Introduction:

Survival Analysis is a Univariate Statistical technique in determining failure times(response variable) with lots of covariates. This technique is an extension of Logistic Regression, because there is a possibility of censoring happening in the data. Censoring is the condition where the failure time was failed to be observed at the end of the defined period of observation. There are two different types of censoring, depending on how the investigator define the period of observation. In type 1 censoring, the investigator define period of observation as a fixed value. In type 2 censoring, the investigator define period of observation as a random variable after d number of failures. The basic goal of survival analysis is to estimate and to interpret survivor function or hazard function, to compare hazard and/or survivor function, and to assess the relationship between covariates to survival time.

The main goal of this paper is to inform the reader about some Basic Survival Analysis techniques and to apply that technique in a small dataset setting.

Theory:

Distribution of Failure Times

Distribution of Failure Times is an important aspect of survival analysis. In order to correctly make an inference about survival data, it is necessary to obtain the exact distribution in which the survival data belong.

Common Vocabulary used in Distribution of Failure Times:

1.Survivor Function Survivor Function S(t) is the probability that a person survive longer than the specified time. The Survival Function is fundamental to Survival Analysis. This is often expressed by Kaplan Meier Curve. This function can be thought as the complement of cumulative density function.

\(S(t) = P(T>t) = 1- P(T \le t)\)

where T is RV

t is specified period of observation.

2. Probability Density Function Since most of the survival analysis case, we deal with continuous distribution, we can derive Probability Density Function \(f_{t}(t)\) from the survivor function.

\[f_{t}(t)= -S'(t)= \lim_{\delta \to 0^+} P(t \le T < t+\delta)\] to recover the survivor function, it is always possible to take the integral of probability density function.

3.Hazard Function Hazard Function denoted as \(h(t)\) is defined as instantaneous potential per unit time for the event to occur given that the individual has survived up to time t. In contrast to the Survivor Function, hazard function focuses on the event of failing. Depending on the Failure Event that was defined, on average the higher hazard function, the worse the impact on the survival. Since Hazard function is defined as a rate rather than probability, the values of hazard function range between zero and infinity.

According to Cox and Oakes, knowing the survivor function is sufficient enough to be able to derive Hazard Function.

\[h(t) = \lim_{\delta \to 0} P(t\le T < t+\delta|T>t)/ \delta\]

by definition of conditional probability.

\[h(t) = \lim_{\delta \to 0} P((t\le T < t+\delta) n (T > t))/ \delta *P(T>t)\]

\[h(t) = \lim_{\delta \to 0} P(t\le T < t+\delta)/\delta)/P(T>t)\]

By definition, hazard function is just pdf divided by survivor function.

\[h(t) = f_{t}(t) / S(t)\]

where f(t) is defined as Probability density function, and S(t) is defined as survivor function.

These properties above will be useful in assessing distributional form of survival data. Look at appendix 2.2 of Cox and Oakes for more detail.

Parametric Statistical Analysis:

From Survivor Function, and Hazard Function, it is possible to recover the survival distribution and gave some criteria for the appropriate choice of family.

Suppose that the specific family is selected and so the distribution is known up to a vector parameter of big omega a single sample of failure time.

Often, big omega transpose is written as (small omega transpose, and lambda transpose) where small omega transpose represent important parameter of survival distribution, and lambda transpose is just a noise parameter.

There are some theoretical aspect of Parametric Statistical Analysis that would be discussed in this part of the paper.

The Likelihood Function Survival Distribution has two components in it. it has continuous distribution until the period of observation and the contribution of the censored. That is to say that if the event happen to the subject before the period of observation ends, that particular subject only contributes to the continuous distribution. However, if the subject survived longer than the period of observation, that particular subject contributes to both of the continuous distribution and the censored distribution.

the full likelihood for n-independent subjects, indexed by i

\[lik = \prod_{u} {f(ti;\phi)} * \prod_{c} {S(ci;\phi)}\]

usually, it is easier to take the log transformation of likelihood, because the multiplication will become summation.

\[l= \sum_{u} \log{f(ti;\phi)} + \sum_{c} \log{f(ti;\phi)}\]

to get the maximum likelihood of phi, just differentiate the log-likelihood with respect to phi, and set it equal to 0.

The Major difference between Parametric Survival Model and the Proportional Hazard model is the distribution for survival time. In the Parametric Survival Model the distribution for survival time is known, and it follows certain distribution with unknown parameter. The proportional Hazard model that will be discussed in the subsequent part of this paper, does not have distribution for survival time. In fact the baseline model of proportional hazard model remains unknown.

Kaplan Meier Survival Curve

Kaplan Meier method is a way to estimate and graph survival curve using product limit formula.

General KM formula: \(S(t_{f-1}) * P(T>T_{f}|T \ge t_{f})\)

where \(S (t_{f-1}) = \prod_{i=1}^{f-1} P(T>t_{f}|T\ge t_{f})\)

General formula for Kaplan Meier gives the probability of surviving past the previous failure times multiplied by the conditional probability of surviving past failure time, given survival function to at least that failure time.

The formula above can be proven using basic rule of conditional probability.

The main advantage of Kaplan Meier survival curve is to visually see the impact of treatment as oppose to the control group. In addition to that, if the experiment consists of multiple treatment, the Experimenter may be able to visually see which treatment is the best for the patients.

Log Rank Test:

Log Rank Test is used to formally test that the true population survival curve is different. This is very similar to testing chi-square test of homogeneity. That is to say if the true population of survival curve is significantly different, we would expect that the proportion of survival group in treatment to be different than the proportion of survival group in control group.

When two groups are present for comparison, it is approximately distributed at chi-square with (2-1) df. The chisquare distribution with 1 df, is just a (standard normal)^2 distribution.

therefore \(\mbox {log-rank test} = (O_2- E_2)^2 / var(O_2 - E_2)\)

When more than two groups are present for comparison, it is approximately distributed at chi-square (G -1) df. Usually computer software is used to do exact log rank test calculation. However, it can also be approximated using \(\sum_{ij} (O_{ij}-E_{ij})^2/ E_{ij}\) and compare the result with chi-square table to get the p-value.

If the p-value is less than specified \(\alpha\), then we have enough evidence that the population survival curve are different. However, if the p-value is greater than the specified \(\alpha\), then we do not have enough evidence to show that population survival curve are different.

There are different variations of Test Statistic to determine population survival curve, depending on the weight of Test Statistic. This determination of weight, should come from prior information on which part of survival curve is more important. This is where Bayesian point of view is being used in Survival Analysis.

Here is the example of log rank test being used in small dataset:

Analysis on small dataset of survival data done in R

leukemia_patient=read.csv("leukemia.csv",header=T)
leukemia_patient
##    Time event Treatment
## 1     6     1      6-MP
## 2     6     1      6-MP
## 3     6     1      6-MP
## 4     6     0      6-MP
## 5     7     1      6-MP
## 6     9     0      6-MP
## 7    10     1      6-MP
## 8    10     0      6-MP
## 9    11     0      6-MP
## 10   13     1      6-MP
## 11   16     1      6-MP
## 12   17     0      6-MP
## 13   19     0      6-MP
## 14   20     0      6-MP
## 15   22     1      6-MP
## 16   23     1      6-MP
## 17   25     0      6-MP
## 18   32     0      6-MP
## 19   32     0      6-MP
## 20   34     0      6-MP
## 21   35     0      6-MP
## 22    1     1   Placebo
## 23    1     1   Placebo
## 24    2     1   Placebo
## 25    2     1   Placebo
## 26    3     1   Placebo
## 27    4     1   Placebo
## 28    4     1   Placebo
## 29    5     1   Placebo
## 30    5     1   Placebo
## 31    8     1   Placebo
## 32    8     1   Placebo
## 33    8     1   Placebo
## 34    8     1   Placebo
## 35   11     1   Placebo
## 36   11     1   Placebo
## 37   12     1   Placebo
## 38   12     1   Placebo
## 39   15     1   Placebo
## 40   17     1   Placebo
## 41   22     1   Placebo
## 42   23     1   Placebo
library(splines)
library(survival)
attach(leukemia_patient)
summary(Treatment)
##    6-MP Placebo 
##      21      21

Log-Rank-Test

log_rank=survdiff(Surv(leukemia_patient$Time,leukemia_patient$event)~leukemia_patient$Treatment,rho=1)
log_rank
## Call:
## survdiff(formula = Surv(leukemia_patient$Time, leukemia_patient$event) ~ 
##     leukemia_patient$Treatment, rho = 1)
## 
##                                     N Observed Expected (O-E)^2/E
## leukemia_patient$Treatment=6-MP    21     5.12    12.00      3.94
## leukemia_patient$Treatment=Placebo 21    14.55     7.68      6.16
##                                    (O-E)^2/V
## leukemia_patient$Treatment=6-MP         14.5
## leukemia_patient$Treatment=Placebo      14.5
## 
##  Chisq= 14.5  on 1 degrees of freedom, p= 0.000143

The P-value obtained from this log rank test is significantly small, indicating that the survival curve is significantly different between 6-MP group and Placebo group.

Cox-Proportional Hazard Model

Cox-Proportional Hazard Model is o semiparametric hazard model. This proportional hazard model has two major components in the equation

\(\mbox h(X,T) = h_0 (t) * exp (\sum_i(\beta i*Xi))\) Where \(h_o(t)\) is the baseline hazard, and \(exp (\sum_i(\beta i*Xi))\) is the exponential term. The major advantage of Cox-Proportional hazard model is that the baseline hazard does not need to be specified/semiparametric. In addition to that, Cox-Proportional Hazard Model has the robust estimate to the baseline hazard. In addition to that the exponential term of this Cox-Proportional Hazard Model will ensure that the Hazard function will always be non-negative.

One of the assumption of Cox-Proportional Hazard model is the time-independent covariates, or the covariates must not be changing over time. The Covariates like smoking status will easily failed this assumption, because people can change their smoking habit. For this reason, there are extension of this cox-proportional hazard model using joint probability technique.

Going back to Cox-Proportional Hazard Model. Usually, there are Three Statistical Objectives in Cox-Proportional Hazard model: 1. Test for significance of the effect. 2. Point estimate of the effect 3. Convidence-Interval of the effect.

Testing for the Significance of the Effect:

There are two different Test Statistic used to test for the significance of the effect. The first test is Z-Statistic or commonly known as Wald Statistic. In this Test Statistic, Z was obtained from dividing coefficient with the corresponding Standard Error. The p-value from this Test Statistic was obtained from Z-table. If the p-value < alpha, then the effect of this particular covariate is significant.

Another Test Statistic that is commonly used to test for the significance of the effect is likelihood ratio test. This test makes use of the log-likelihood statistic obtained and multiply is by -2, and compared it with chi-square distribution with group-1 degree of freedom. If the p-value is less than alpha, then the effect of this particular covariate is significant.

Usually, Likelihood Ratio has better Statistical property, due to uniformly most powerful test by Neyman-Pearson Lemma. Therefore, when in doubt, Statistician usually used Likelihood Ratio Test.

Point Estimate of the Effect:

In the experimental Design point of view, the point estimate of the effect is usually correspond to the least square estimate of the coefficient. Under normal distribution, the least square estimate was obtained from taking maximum-likelihood estimation of the coefficient. In Survival Analysis point of view, the point estimate of the effect is usually correspond to the hazard ratio. Similar to Odds Ratio, Hazard Ratio of 1 indicate that there is no different between treatment group and control group. Hazard Ratio that is greater than 1, indicate that the Treatment Group has more hazard than the Control Group. Hazard Ratio that is less than 1, indicate that the treatment group has less hazard than the control group.

Confidence Interval of the effect:

Confidence Interval of the effect is usually one of the main statistical objectives in cox-proportional hazard model. Knowing Confidence Interval of the effect can also test whether or not that covariates is significant in determining survival time. Confidence Interval of the effect varies depending on whether or not the interaction between covariates is significant. If the interaction of covariates is not significant, then it is easier to model the 95% confidence interval.

\[\mbox {95% CI for HR} = exp[\beta_i\pm 1.96\sqrt(var(\beta_i))]\]

the sd(\(\beta_i\)) = \(S\beta_i\)

if The interaction between covariates is significant, then the 95% CI of HR is slightly changes.

It is necessary to defined a term \(l= \beta_i +\sum_i (\beta_i*Wi)\) where Wi correspond to the significant interaction between covariates.

\[\mbox {95% CI of HR} = exp[l \pm1.96\sqrt(var(\beta_i))]\]

Usually, the calculation of Confidence Interval of the effect is done in Statistical Software such as R, or SAS.

An attempt to use Cox-Proportional Hazard Model to determine the effect of treatment

response_variable=Surv(leukemia_patient$Time,leukemia_patient$event)
surv_model=coxph(response_variable~Treatment)
summary(surv_model)
## Call:
## coxph(formula = response_variable ~ Treatment)
## 
##   n= 42, number of events= 30 
## 
##                   coef exp(coef) se(coef)    z Pr(>|z|)    
## TreatmentPlacebo 1.572     4.817    0.412 3.81  0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## TreatmentPlacebo      4.82      0.208      2.15      10.8
## 
## Concordance= 0.69  (se = 0.053 )
## Rsquare= 0.322   (max possible= 0.988 )
## Likelihood ratio test= 16.4  on 1 df,   p=5.26e-05
## Wald test            = 14.5  on 1 df,   p=0.000138
## Score (logrank) test = 17.2  on 1 df,   p=3.28e-05

Analysis part II

surv_model_2=survreg(response_variable~leukemia_patient$Treatment,dist="exponential")
summary(surv_model_2)
## 
## Call:
## survreg(formula = response_variable ~ leukemia_patient$Treatment, 
##     dist = "exponential")
##                                   Value Std. Error     z        p
## (Intercept)                        3.69      0.333 11.06 2.00e-28
## leukemia_patient$TreatmentPlacebo -1.53      0.398 -3.83 1.27e-04
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -108.5   Loglik(intercept only)= -116.8
##  Chisq= 16.49 on 1 degrees of freedom, p= 4.9e-05 
## Number of Newton-Raphson Iterations: 4 
## n= 42
surv_model_3=survfit(response_variable~leukemia_patient$Treatment)
surv_model_3
## Call: survfit(formula = response_variable ~ leukemia_patient$Treatment)
## 
##                                    records n.max n.start events median
## leukemia_patient$Treatment=6-MP         21    21      21      9     23
## leukemia_patient$Treatment=Placebo      21    21      21     21      8
##                                    0.95LCL 0.95UCL
## leukemia_patient$Treatment=6-MP         16      NA
## leukemia_patient$Treatment=Placebo       4      12

Analysis part III

surv_new= expand.grid(Treatment=c("6-MP","Placebo"))
t(surv_new)
##           [,1]   [,2]     
## Treatment "6-MP" "Placebo"
surv_fit_1=survfit(surv_model,surv_new)
surv_fit_1
## Call: survfit(formula = surv_model, newdata = surv_new)
## 
##   records n.max n.start events median 0.95LCL 0.95UCL
## 1      42    42      42     30     23      16      NA
## 2      42    42      42     30      8       5      12
summary(surv_fit_1)
## Call: survfit(formula = surv_model, newdata = surv_new)
## 
##  time n.risk n.event survival1 survival2
##     1     42       2     0.983    0.9227
##     2     40       2     0.966    0.8453
##     3     38       1     0.956    0.8067
##     4     37       2     0.937    0.7293
##     5     35       2     0.915    0.6520
##     6     33       3     0.880    0.5415
##     7     29       1     0.869    0.5077
##     8     28       4     0.818    0.3794
##    10     23       1     0.803    0.3468
##    11     21       2     0.771    0.2849
##    12     18       2     0.731    0.2216
##    13     16       1     0.708    0.1899
##    15     15       1     0.685    0.1620
##    16     14       1     0.659    0.1341
##    17     13       1     0.633    0.1101
##    22      9       2     0.554    0.0580
##    23      7       2     0.445    0.0202
plotstuff=expand.grid(line=c("solid"), colour=c("red","blue"),stringsAsFactors=F)
plotstuff
##    line colour
## 1 solid    red
## 2 solid   blue

Evaluating Cox-Proportional Hazard Model

Suppose that an experimenter wants to know whether Cox-Proportional Hazard model. It is possible to do Goodness of fit to test whether or not Cox-Proportional Hazard model is appropriate for the Survival Data Analysis. There are three ways to check the whether the survival data fit with Cox-Proportional Hazard model assumption. The first way is through the Graphical Representation of -ln(-ln(S)) or Kaplan-Meier Curve vs the estimated curve using Cox-Proportional Hazard model. In addition to the Graphical representation, it is also possible to do Goodness of fit test, or extended Cox-model.

Graphical Method:

ln(-ln(S)): This is the first way to check whether or not the covariates in the data violates Cox Proportional Hazard model. If the ln(-ln(S)) curves show non-parallel pattern on certain covariates, then that covariate violates the assumption of Cox-Proportional Hazard model. This approach is very subjective. In addition to that, the parallel ln(-ln(S)) does not necessarily mean that the covariates follow the Cox-Proportional hazard Model assumption.

Here is the example of ln(-ln(S)) plot to assess whether the treatment follows proportional hazard assumption

#try to plot log(-log(survival))
library(splines)
library(survival)
surv_model_3=survfit(response_variable~leukemia_patient$Treatment)
surv_model_3
## Call: survfit(formula = response_variable ~ leukemia_patient$Treatment)
## 
##                                    records n.max n.start events median
## leukemia_patient$Treatment=6-MP         21    21      21      9     23
## leukemia_patient$Treatment=Placebo      21    21      21     21      8
##                                    0.95LCL 0.95UCL
## leukemia_patient$Treatment=6-MP         16      NA
## leukemia_patient$Treatment=Placebo       4      12
surv_model_4=summary(surv_model_3)
surv_model_4$surv
##  [1] 0.85714 0.80672 0.75294 0.69020 0.62745 0.53782 0.44818 0.90476
##  [9] 0.80952 0.76190 0.66667 0.57143 0.38095 0.28571 0.19048 0.14286
## [17] 0.09524 0.04762 0.00000
data_frame=data.frame(surv_model_4$strata,surv_model_4$time,surv_model_4$surv)
names(data_frame)=c("treatment","time","survival")
data_frame[10:15,]
##                             treatment time survival
## 10 leukemia_patient$Treatment=Placebo    3   0.7619
## 11 leukemia_patient$Treatment=Placebo    4   0.6667
## 12 leukemia_patient$Treatment=Placebo    5   0.5714
## 13 leukemia_patient$Treatment=Placebo    8   0.3810
## 14 leukemia_patient$Treatment=Placebo   11   0.2857
## 15 leukemia_patient$Treatment=Placebo   12   0.1905
treatment_1=data_frame[data_frame$treatment=="leukemia_patient$Treatment=6-MP",]
placebo=data_frame[data_frame$treatment=="leukemia_patient$Treatment=Placebo",]
plot(treatment_1$time,log(-log(treatment_1$survival)),xlab="remission time in weeks",ylab="log-log survival",xlim=c(0,40),col="red",type="l",lty="solid",main="log-log curves by treatment")
par(new=T)
plot(placebo$time,log(-log(placebo$survival)),axes=F,xlab="remission time in weeks",ylab="log-log survival",xlim=c(0,40),ylim=c(-2,2),col="brown",type="l",lty="dashed",main="log-log curves by treatment")

plot of chunk unnamed-chunk-6

From this log(-log(S)) it seems that the proportional hazard model assumption for the treatment covariates seems to be violated, because the parallel assumption is violated.

plot observed Kaplan Meier Curve and compare it with survival curve under Cox-PH model

The idea here is to compare the Kaplan Meier Curve and the survival curve measured in Cox-PH model. If the Kaplan Meier Curve is different from the survival Cox-PH model, then the PH assumption is violated.

plot_1=plot(surv_model_3,lty=c("solid","dashed"),col=c("black","blue"),xlab="remission time in days",ylab="survival_probability")
legend("topright",c("6-MP","Placebo"),lty=c("solid","dashed"),col=c("black","blue"))

plot of chunk unnamed-chunk-7

plot_2=plot(surv_fit_1,lty=plotstuff$line,col=plotstuff$colour)
legend("topright",legend=surv_new$Treatment,title="treatment",lty=plotstuff$line,col=plotstuff$col)
## Warning: Name partially matched in data frame

plot of chunk unnamed-chunk-8

In this Case, the survival curve and Kaplan Meier Curve seems to be close, therefore the treatment group complies with PH assumption.

It is Interesting to see that sometimes log-log curves and plot observed vs expected values can give a different result. Since graphical test is very subjective, it is important for the Statistician to develop a formal statistical test.

Goodness of Fit Test:

The Goodness of Fit Test is an objective testing, because it provides a test statistic and p-value for assessing the PH assumption for a given predictor of interest. A number of different test statistic were proposed by different researcher, but the common method of to do Goodness of Fit Test is to take a look at Schoenfeld Residuals. The idea here is that if the PH assumption holds for a particular covariates, then its schoenfeld residual for that covariate will not be related to survival time.

Therefore, it is a way to check the correlation between schoenfeld residual vs survival time. If the correlation is not equal to 0, therefore the PH assumption is violated.

cox.zph(surv_model,transform=rank)
##                      rho  chisq     p
## TreatmentPlacebo -0.0296 0.0258 0.872

Since the P-value of Schoenfeld residual test is high, therefore the treatment covariate complies with the cox-proportional hazard assumption.

Stratified Cox Model

Stratifited Cox Model is a modification of CoxPH model that allows for control by stratification of predictor not satisfying PH assumption. The idea is to run Cox Proportional hazard model on the variables that satisfy Cox PH assumption under the strata of the variables not satisfy Cox PH assumption.

General Stratified Cox Model

Suppose that we have k variables that do not satisfy PH assumption denoted by Z1,Z2,…,Zk and p variables satisfy PH assumption X1,X2,X3,…,Xp. To perform the stratified cox procedure, we define a single Z * from the Z used for stratification. Therefore Z has k * categories, where k * is the total # of combinations of strata.

Hazard Function of General Stratified Cox Model: \(\mbox h_g (t,X) = h_{0g} (t) * exp(\beta_1*X1+\beta_2*X2+...+\beta_p*Xp)\)

In this case, the Z* variable that does not follow PH assumption is not included in the model. Unlike before, the baseline hazard function will be different for each of the strata. i.e \(\mbox h_{01} (t) != h_{02} (t) != h_{03} (t) !=.... != h_{0g} (t)\)

but the \(\beta's\) is the same for every strata. To obtain the estimate of for the \(\beta's\) we maximized the partial likelihood function that was obtained by multiplying likelihood functions together.

No Interaction Assumption of SC

Stratified Cox Model \(h_g(t,X) = h_{og} (t)*exp(\beta_1*X1+\beta_2*X2+....+\beta_p*Xp)\)

In this case, \((\beta_1,\beta_2,....,\beta_p)\) contains the same value for each of the strata.

Interaction model by fitting separate models

\(h_g(t,X)= h_{og} (t)*exp(\beta_{1g}*X1+\beta_{2g}*X2+....+\beta_{ng}*Xn)\) where g = 1 for one classification g=2 for another classification of strata.

note that \((\beta_{1g},\beta_{2g},....,\beta_{ng})\) contains different value for each of the strata variable.

Alternative Interaction Model:

\[ h_g(t,X)= h_{og} (t)*exp(\beta_{1}*X1+\beta_{2}*X2+....+\beta_{g}*Xg+\beta_{11}*(Z1*X1)+\beta_{12}\]

\[*(Z2*X1)+...+\beta_{1g}(Zg*X1)+\beta_{21}*(Z1*X2)+\beta_{22}*(Z2*X2)+....+\beta_{2g}\]

\[*(Zg*X2)+........+\beta{g1}*(Zg*X1)+.....+\beta{gg}*(Zg*Xg)\]

This model is extremely complicated. However, there is some sort of manipulation that can be done to equate both Interaction Model.

The Example of Stratified Cox Model from Chapter 5 Test Question:

Suppose that we are trying to analyze time of days spent by heroin addicts from entry to departure from one of the two methadone clinics, the other two covariates were also taken into consideration(prison record and methadone dose)

#An Attempt to do Statistical Analysis for the Test Question from Chapter 5
addicts=read.table("addicts.txt",header=F)
head(addicts)
##   V1 V2 V3  V4 V5 V6
## 1  1  1  1 428  0 50
## 2  2  1  1 275  1 55
## 3  3  1  1 262  0 55
## 4  4  1  1 183  0 30
## 5  5  1  1 259  1 65
## 6  6  1  1 714  0 55
names(addicts)= c("ID","clinic","surv_status","surv_time","prison","methadone_dose")
library(splines)
library(survival)
response=Surv(addicts$surv_time,addicts$surv_status)
explanatory_var=cbind(addicts$clinic,addicts$prison,addicts$methadone_dose)
#explanatory_var
#no interaction model
addict_cox=coxph(response~addicts$clinic+addicts$prison+addicts$methadone_dose)
#interaction model
addict_cox_2=coxph(response~addicts$clinic+addicts$prison+addicts$methadone_dose+addicts$clinic*addicts$prison+addicts$clinic*addicts$methadone_dose)
addict_cox
## Call:
## coxph(formula = response ~ addicts$clinic + addicts$prison + 
##     addicts$methadone_dose)
## 
## 
##                           coef exp(coef) se(coef)     z       p
## addicts$clinic         -1.0099     0.364  0.21489 -4.70 2.6e-06
## addicts$prison          0.3266     1.386  0.16722  1.95 5.1e-02
## addicts$methadone_dose -0.0354     0.965  0.00638 -5.54 2.9e-08
## 
## Likelihood ratio test=64.6  on 3 df, p=6.23e-14  n= 238, number of events= 150
addict_cox_2
## Call:
## coxph(formula = response ~ addicts$clinic + addicts$prison + 
##     addicts$methadone_dose + addicts$clinic * addicts$prison + 
##     addicts$clinic * addicts$methadone_dose)
## 
## 
##                                          coef exp(coef) se(coef)      z
## addicts$clinic                         0.1796     1.197   0.8933  0.201
## addicts$prison                         1.1924     3.295   0.5414  2.202
## addicts$methadone_dose                -0.0192     0.981   0.0194 -0.990
## addicts$clinic:addicts$prison         -0.7383     0.478   0.4315 -1.711
## addicts$clinic:addicts$methadone_dose -0.0140     0.986   0.0143 -0.974
##                                           p
## addicts$clinic                        0.840
## addicts$prison                        0.028
## addicts$methadone_dose                0.320
## addicts$clinic:addicts$prison         0.087
## addicts$clinic:addicts$methadone_dose 0.330
## 
## Likelihood ratio test=68.2  on 5 df, p=2.45e-13  n= 238, number of events= 150
names(addict_cox_2)
##  [1] "coefficients"      "var"               "loglik"           
##  [4] "score"             "iter"              "linear.predictors"
##  [7] "residuals"         "means"             "concordance"      
## [10] "method"            "n"                 "nevent"           
## [13] "terms"             "assign"            "wald.test"        
## [16] "y"                 "formula"           "call"
#check for violation of proportional hazard
res_addict_cox=cox.zph(addict_cox)
res_addict_cox
##                            rho chisq        p
## addicts$clinic         -0.2578 11.19 0.000824
## addicts$prison         -0.0382  0.22 0.639369
## addicts$methadone_dose  0.0724  0.70 0.402749
## GLOBAL                      NA 12.62 0.005546
plot(res_addict_cox)

plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10

residuals(res_addict_cox, type = "scaledsch")
## NULL
addict_cox_strata=coxph(response~strata(addicts$clinic)+addicts$prison+addicts$methadone_dose)
summary(addict_cox_strata)
## Call:
## coxph(formula = response ~ strata(addicts$clinic) + addicts$prison + 
##     addicts$methadone_dose)
## 
##   n= 238, number of events= 150 
## 
##                            coef exp(coef) se(coef)     z Pr(>|z|)    
## addicts$prison          0.38960   1.47640  0.16893  2.31    0.021 *  
## addicts$methadone_dose -0.03511   0.96549  0.00646 -5.43  5.6e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                        exp(coef) exp(-coef) lower .95 upper .95
## addicts$prison             1.476      0.677     1.060     2.056
## addicts$methadone_dose     0.965      1.036     0.953     0.978
## 
## Concordance= 0.651  (se = 0.034 )
## Rsquare= 0.133   (max possible= 0.994 )
## Likelihood ratio test= 33.9  on 2 df,   p=4.32e-08
## Wald test            = 32.7  on 2 df,   p=8.08e-08
## Score (logrank) test = 33.3  on 2 df,   p=5.77e-08
res_addict_cox_2=cox.zph(addict_cox_strata)
plot(res_addict_cox_2)

plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10

#plot the Kaplan Meier survival estimate at specified times.

kmfit_2=survfit(response~addicts$clinic)
summary(kmfit_2,times=c(0,100,200,300,400,500,600,700,800,900,1000))
## Call: survfit(formula = response ~ addicts$clinic)
## 
##                 addicts$clinic=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    163       0   1.0000  0.0000      1.00000        1.000
##   100    137      20   0.8746  0.0262      0.82467        0.928
##   200    110      20   0.7420  0.0353      0.67601        0.814
##   300     87      20   0.6046  0.0399      0.53120        0.688
##   400     68      14   0.5025  0.0415      0.42741        0.591
##   500     53       9   0.4319  0.0418      0.35719        0.522
##   600     30      16   0.2951  0.0403      0.22570        0.386
##   700     20       8   0.2113  0.0383      0.14818        0.301
##   800     10       8   0.1268  0.0326      0.07660        0.210
##   900      1       7   0.0181  0.0172      0.00283        0.116
## 
##                 addicts$clinic=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0     75       0    1.000  0.0000        1.000        1.000
##   100     66       5    0.932  0.0294        0.876        0.991
##   200     58       7    0.832  0.0442        0.750        0.924
##   300     50       7    0.730  0.0530        0.633        0.842
##   400     43       3    0.685  0.0558        0.584        0.804
##   500     39       2    0.653  0.0577        0.549        0.776
##   600     27       1    0.634  0.0590        0.528        0.761
##   700     19       1    0.606  0.0625        0.495        0.742
##   800     11       1    0.575  0.0669        0.457        0.722
##   900      7       1    0.517  0.0812        0.380        0.703
##  1000      3       0    0.517  0.0812        0.380        0.703
plot(kmfit_2,lty=c("solid","dashed"),col=c("black","blue"),xlab="survival times in days",ylab="survival probability")

legend("topright",c("clinic 1","clinic 2"),lty=c("solid","dashed"),col=c("black","blue"))

plot of chunk unnamed-chunk-10

#likelihood ratio test for interaction model
addict_cox_2$loglik
## [1] -705.5 -671.5
summary(addict_cox)
## Call:
## coxph(formula = response ~ addicts$clinic + addicts$prison + 
##     addicts$methadone_dose)
## 
##   n= 238, number of events= 150 
## 
##                            coef exp(coef) se(coef)     z Pr(>|z|)    
## addicts$clinic         -1.00990   0.36426  0.21489 -4.70  2.6e-06 ***
## addicts$prison          0.32655   1.38618  0.16722  1.95    0.051 .  
## addicts$methadone_dose -0.03537   0.96525  0.00638 -5.54  2.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                        exp(coef) exp(-coef) lower .95 upper .95
## addicts$clinic             0.364      2.745     0.239     0.555
## addicts$prison             1.386      0.721     0.999     1.924
## addicts$methadone_dose     0.965      1.036     0.953     0.977
## 
## Concordance= 0.665  (se = 0.026 )
## Rsquare= 0.238   (max possible= 0.997 )
## Likelihood ratio test= 64.6  on 3 df,   p=6.23e-14
## Wald test            = 54.1  on 3 df,   p=1.06e-11
## Score (logrank) test = 56.3  on 3 df,   p=3.6e-12
LRT=(-2)*(addict_cox$loglik[2]-addict_cox_2$loglik[2])
p_value= 1-pchisq(LRT,2)
p_value
## [1] 0.1638

From fitting the plot of res_addict_cox and the output of res_addict_cox, it is evident that clinic show a curve pattern which violates the PH assumption. Therefore one way to analyze this data is to stratify the clinic variable. The main drawback from using Stratified cox procedure is the inability for the experimenter to get the main effect of clinic.

From the adjusted survival curves, it is evident that the Kaplan Meier estimate does not differ by a lot for the first 300 days, and clinic 2 starts to have a better survival curves after 300 days.

Non Interaction model hazard function: \[h(t,X)=hog(t)*exp(\beta_1*prison+\beta_2*dose)\]

This is a no interaction model because the beta’s output is the same for both clinic 1 and clinic 2 group.

From addict_cox_strata output:

point estimate of prison adjusted on clinic and dose was hazard ratio of 1.476. The 95% interval estimate was obtained at (1.059-2.054)

To check whether or not non-interaction model is satisfied. It is possible to likelihood ratio test for the interaction model-non-interaction model and compare it with the chi-square distribution with 2 degree of freedom, because of extra 2 variables in interaction model.

From the likelihood ratio test, the P-value was obtained at around 0.16, therefore the non-interaction model is sufficient to analyze this data.

Extended Cox Model

Review from our Cox PH model hazard function, the exponential part of the hazard function does not depend on any function of time. However, there are cases where time-dependent covariates are considered in the model, and the assumption of Cox PH model is no longer valid. There are two options when PH assumption is not satisfied the Stratified Cox Model, and extended cox model.

Time-Dependent Variable

Time-dependent variable is defined as any variable whose value for a given subject may differ over time. There are two common type of Time-dependent variable, internal variable and ancillary variable. Internal variable is a variable that changes in value due to characteristics or behaviour of the individual. For example, the obesity level for an individual is mostly affected by personal eating habit and the frequency of exercise. On the other hand, ancillary variable is a variable that changes in value due to external circumstances. For example, the employment rate of a province in any given time is mostly affected by general economic circumstances.

In the presence of Time-Dependent variable in cox-model, it is important to define a function of time called g(t) which is a dummy variable to take value of 1 if T is greater than or equal to specified value of t, 0 if T is less than or equal to specified value of t. This function of time is commonly referred as a “heaviside” function.

The Extended Cox Model for Time-Dependent Variables

\[ h(t,X(t))= h_o (t)*exp((\sum_i \beta_i X_i)+(\sum_j \delta_j*X_j(t)))\]

where \(X_i\) are the time-independent covariates and \(X_j(t)\) are time-dependent covariates.

The regression coefficients are estimated using maximum likelihood procedure. However, the computation of this maximum likelihood are more complicated in extended cox model, because the risk sets used to form likelihood function are more complicated with time-dependent variables.

An important assumption of this model is the effect of a time-dependent variables on survival depends on the value of this variable at that particular time t, and it is not the same for earlier or later time.

Note that at any given time, the hazard function will only provide one coefficient for each time-dependent variable in the model. Therefore the hazard ratio is unique at that particular time t only.

lag-time effect

The hazard function can also be modified to adjust for “lag-time effect” to illustrate the idea of lag-time effect, suppose that the probability of surviving heart-attack depends on the probability of surviving heart-attack from the previous week. therefore to model this assumption: \[h(t,X(t)) = h_o (t)*exp(\delta*survival(t-1))\]

General lag time extended model: \[h(t,X(t)) = h_o(t)*exp((\sum_i \beta_i*Xi)+(\sum_j \delta_j*Xj(t-Lj))\]

where Lj is the lag time specified for the model.

The Hazard Ratio Formula for the Extended Cox Model

To prove that PH assumption is not valid for the extended Cox Model, we can simply take the estimated hazard ratio of two individuals having both time dependent and time-independent covariates.

HR = h(t,X*(t))/h(t,X(t)) The baseline for hazard ratio cancelled out, therefore \[HR = exp((\sum_i \beta_i*(Xi*-Xi))+(\sum_j \delta_j*(Xj*(t)-Xj(t))))\]

suppose that Xi* is the exposed group and equal to 1, and Xi is the control group and equal to 0

\[HR = exp(\sum_i \beta_i+\sum_j \delta_j*t)\] therefore hazard ratio increases as time increases and it violates the proportional hazard model.

Assessing Time-Independent Variables That Do Not Satisfy the PH Assumption

In the previous part of the paper, we discussed the graphical methods and Goodness of Fit test. There is another way to check PH assumption using extended Cox model.

\[h(t,X(t)) = h_o (t)*exp((\sum_i \beta_i*Xi)+\sum_j \delta_j X_j g_j(t)))\]

The simplest form of function of time \(g_j(t)\) is simply 0 at any time. This indicates that we do not have any time-dependent covariates.

Another way to handle this is to set a one variable at a time function on a particular time-independent covariate. \(X_l\) only has \(g_l(t)=t\) and 0 o/w

Therefore the Hazard function can be simplified as \[h(t,X(t))=h_o (t)*exp((\sum_i \beta_i*Xi)+(\delta_l*X_l))\]

Another way to handle this survival data is to set up Heaviside function of time g(t)=1 if \(t\ge to\) and 0 if \(t<to\) this method will be discussed in detail later.

The idea here is to test the null hypothesis that Ho:\(\delta_1=\delta_2=....=0\) vs Ha: there is at least one \(\delta\) which is not equal to 0.

Under Ho, the model reduced to Proportional Hazard model, and we can test the Proportional Hazard model using Likelihood ratio test ~ chisquare p where p is the number of delta being set equal to 0.

Back to Heaviside function, the implication of setting the function of time = 0 if t < to implies that there

is a constant hazard ratio for different type of interval. If t < to, the hazard ratio will contain only

exp\((\beta)\). If \(t \ge to\) then, the hazard ratio will contain exp\((\beta+\delta)\)

Therefore using one heaviside function will gives two hazard ratio values each value being contant over a

fixed time interval.

It is possible to set more than one heaviside function, but the main effect of the covariates may not be included in the model.

Application of Extended Cox Model to Remission Survival Data

Suppose that the 42 leukemia patients, half of whom receive a new therapy and half of whom receive standard

therapy. The exposure variable of interest is the treatment status, and two other variables for control

is for log-white blood cell count and sex. The whole purpose of this exercise is to check the

PH assumption using extended cox model, and do some comparison with stratified cox procedure.

library(splines)
library(survival)
remission=read.table("anderson.txt",header=F)
var_name=c("surv","relapse","sex","logWBC","Rx")
names(remission)=var_name
remission
##    surv relapse sex logWBC Rx
## 1    35       0   1   1.45  0
## 2    34       0   1   1.47  0
## 3    32       0   1   2.20  0
## 4    32       0   1   2.53  0
## 5    25       0   1   1.78  0
## 6    23       1   1   2.57  0
## 7    22       1   1   2.32  0
## 8    20       0   1   2.01  0
## 9    19       0   0   2.05  0
## 10   17       0   0   2.16  0
## 11   16       1   1   3.60  0
## 12   13       1   0   2.88  0
## 13   11       0   0   2.60  0
## 14   10       0   0   2.70  0
## 15   10       1   0   2.96  0
## 16    9       0   0   2.80  0
## 17    7       1   0   4.43  0
## 18    6       0   0   3.20  0
## 19    6       1   0   2.31  0
## 20    6       1   1   4.06  0
## 21    6       1   0   3.28  0
## 22   23       1   1   1.97  1
## 23   22       1   0   2.73  1
## 24   17       1   0   2.95  1
## 25   15       1   0   2.30  1
## 26   12       1   0   1.50  1
## 27   12       1   0   3.06  1
## 28   11       1   0   3.49  1
## 29   11       1   0   2.12  1
## 30    8       1   0   3.52  1
## 31    8       1   0   3.05  1
## 32    8       1   0   2.32  1
## 33    8       1   1   3.26  1
## 34    5       1   1   3.49  1
## 35    5       1   0   3.97  1
## 36    4       1   1   4.36  1
## 37    4       1   1   2.42  1
## 38    3       1   1   4.01  1
## 39    2       1   1   4.91  1
## 40    2       1   1   4.48  1
## 41    1       1   1   2.80  1
## 42    1       1   1   5.00  1
response=Surv(remission$surv,remission$relapse)
cox_1=coxph(response~sex+logWBC+Rx,data=remission)
cox_1
## Call:
## coxph(formula = response ~ sex + logWBC + Rx, data = remission)
## 
## 
##         coef exp(coef) se(coef)     z       p
## sex    0.315      1.37    0.455 0.692 4.9e-01
## logWBC 1.682      5.38    0.337 4.997 5.8e-07
## Rx     1.504      4.50    0.462 3.258 1.1e-03
## 
## Likelihood ratio test=47.2  on 3 df, p=3.17e-10  n= 42, number of events= 30
residual=cox.zph(cox_1)
residual
##            rho chisq      p
## sex    -0.3684 4.076 0.0435
## logWBC  0.0595 0.161 0.6883
## Rx     -0.1017 0.344 0.5578
## GLOBAL      NA 4.232 0.2374

Based on the description of the data, sex, logWBC and Rx seems to be time-independent covariates. The printout of the residual shows that sex covariate violates the PH assumption for the model being fit. The p-value from the residual output shows significant on sex variable, which indicates violation of Cox proportional hazard model.

General extended Coxmodel

To use extended Cox model to assess the PH assumption, it is important to model the general form of extended Cox model.

\[h(t,X) = h_o (t)* exp(\beta_1 *sex + \beta_2 *logWBC + \beta_3 * Rx + \delta_1 *sex*g(t) + \delta_2 *logWBC*g(t)+ \delta_3 *Rx*g(t)\]

The Involvement of Heaviside Function:

Suppose that we want to assess the PH assumption for sex variable using heaviside function to yield a constant hazard ratio for less than 15 weeks, and constant hazard ratio for 15 weeks or more follow up.

There are two possible ways to model the involvement of heaviside function depending on whether or not the experimenter using one heavisude function or two heaviside function.

one-heaviside function: \[h(t,X) = h_o (t) * exp(\beta_1 *sex + \beta_2 * logWBC + \beta_3 * Rx + \delta_1 * sex*g(t))\]

where g(t) = 1 if \(0 \le t <15\)

  g(t) = 0 if $t>15$
  

two-heaviside function

\[h(t,X) = h_o (t) *exp(\beta_1 * sex + \beta_2 * logWBC +\beta_3 * Rx +\delta_1 *sex*g_1(t) + \delta_2 * sex *g_2(t))\]

where g_1(t) = 1 if \(0 \le t <15\)

  g_1(t) = 0 if $t > 15$
  
  g_2(t) = 0 if $0 \le t <15$
  
  g_2(t) = 1 if $t > 15$

If we run the two-heaviside model:

remission.cp=survSplit(remission,cut=15,event="relapse",start="start",end="surv",id="id")
remission.cp$hv1=remission.cp$sex*(remission.cp$start<15)
remission.cp$hv2=remission.cp$sex*(remission.cp$start>=15)
new_response=Surv(remission.cp$start,remission.cp$surv,remission.cp$relapse)
ext_cox=coxph(new_response~logWBC+Rx+hv1+hv2+cluster(id),data=remission.cp,method="breslow")
ext_cox
## Call:
## coxph(formula = new_response ~ logWBC + Rx + hv1 + hv2 + cluster(id), 
##     data = remission.cp, method = "breslow")
## 
## 
##         coef exp(coef) se(coef) robust se     z       p
## logWBC 1.593      4.92    0.335     0.330 4.824 1.4e-06
## Rx     1.390      4.01    0.464     0.367 3.783 1.5e-04
## hv1    0.266      1.30    0.479     0.396 0.671 5.0e-01
## hv2    0.248      1.28    1.080     0.739 0.336 7.4e-01
## 
## Likelihood ratio test=43.8  on 4 df, p=7.22e-09  n= 56, number of events= 30

If we want to test the null hypothesis that there is no treatment effect vs there is treatment effect adjusted for logWBC and time-dependent sex variable:

Ho: Rx=0 Ha: Rx!=0

From the ext_cox output the Hazard ratio for treatment effect was estimated at 4.01, which is a little bit different from STATA output from the textbook. The 95% Confidence interval can be easily calculated from the output:

\[\mbox {95% CI for HR} = exp[\beta_i\pm 1.96\sqrt(var(\beta_i))]\]

95% CI was calculated at [1.5246,9.968]

From the output it is evident that there is a significant treatment effect for the remission of survival times on 42 leukemia patients.

Comparing the result above with Stratified Cox Procedure

response=Surv(remission$surv,remission$relapse)
remission_cox_strata=coxph(response~logWBC+Rx+strata(sex),data=remission)
remission_cox_strata
## Call:
## coxph(formula = response ~ logWBC + Rx + strata(sex), data = remission)
## 
## 
##         coef exp(coef) se(coef)    z       p
## logWBC 1.454      4.28    0.344 4.22 2.4e-05
## Rx     0.998      2.71    0.474 2.11 3.5e-02
## 
## Likelihood ratio test=32.1  on 2 df, p=1.09e-07  n= 42, number of events= 30

From the Stratified Cox Procedure, the Rx is barely significant at 5% level. This is different from the extended cox-procedure which indicates that Rx is signicant at 1% level. To choose which model is more appropriate would be to compare goodness of fit statistics or to plot adjusted survival curves for each model to see which one fits better to data.

Reference:

  1. Singh R, Mukhopadhay K. 2011. Survival analysis in clinical trials: Basic and must know areas.:Perpect Clin Res.2(4):145-148

  2. Kleinbaum DG, Klein M. 2012. Survival Analysis: A Self-Learning Text, Third Edition. New York, NY.:Springer.700 p.

  3. Cox D, Oakes D. 1984. Analysis of Survival Data. London, UK.:Chapman and Hall Ltd. 201 p.