Abstract

This article presents an application of the Kaplan-Meier estimator and real data based on the choice of treatments for prostate cancer patients, which contains several time dependent covariates. The observations is measured repeatedly and fit using the Cox propotional hazard models as the Cox PH model is the most popular method for survival data. A detailed example using data from a clinical trial of estrogen treatment for prostatic cancer is presented. In this study, significant treatment-covariate interactions were detected. While conducting the study, we find that some covariates that were derived from the Cox PH model are appropriate to the data. We further discovered the patterns in survival based on the choice of treatments for cancer patients. The covariates that would be strongly related to our study are treatment, age in years, weight index, history of cardiovascular disease, size of the primary tumor.

Introduction

Recent advances in treatment have extended the expected survival of cancer patients up to and even beyond ten years after diagnosis. This leads to a non-negligible risk of death due to other causes than the cancer of interest, more particularly for older patients. In this context, it is crucial to account for the competing causes of death and discover the pattern of the survival for cancer patients. We consider methods for the analysis of data when the response of interest is the time until some event occurs, such events are generically reffered to as a failure. Most of the statistical research was concentrated on parametric models. While the survival analysis attempts to cover both the parametric and nonparametric methods, the emphasis is on the more recent nonparametric developments with applications to medical research.

Survival analysis examines and models the time it takes for events to occur. The prototypical such event is death, from which the name “survival analysis” terminology derives. Survival analysis focuses on he distribution of survival times. Survival times are data that measure follow-up time from a defined starting point to the occurence of a given event. For example, the time from the start to the end of a remission period or the time from the diagnosis of a disease to death. Standard statistical techniques cannot usually be applied because the underlying distribution is rarely Normal and the data are often ‘censored’. Survival time is described as censored when there is a follow-up time, but the event has not yet occurred or is not known to have happened. For example, if remission time is being studied and the patient is still in remission at the end of the study, then that patient’s remission time would be censored. If a patient for some reason drops out of a study before the end of the study period, then that patient’s follow-up time would also be considered to be censored.

I will study the covariates linear relationship for each convariates in relation to the martingale residuals model. Further analysis on this data will be done throughout the study which will provide a summarized account of the techniques for analysis such as the Kaplan-Meier estimator, AIC and Cox’s PH model for time dependent covariates. Checking the PH assumption and applying the methodologies, explanations of the suitable Cox’s PH model obtained by the criteria AIC and based on the residuals explanation will be provided in the Discussion Section. Following by concluding remarks and more details on the analysis will be shown in appendix.

Data Characteristics

Prostate cancer is commonly a leading cause of death. This data were obtained from a randomized trial comparing four treatments for patients with prostatic cancer in stages 3 and 4 with almost equal numbers of patients on placebo and each of three doses of estrogen. The data for 502 patients consist of an identification number, stage of tumor, the total months of follow-up since randomization, an indicator for the survival status or cause of death.

The following table shows the variables available and their definitions.

Item Variable Definition
1 patno Patient number
2 stage Stage
3 rx Treatment
4 dtime Months of follow-up
5 status Follow-up status
6 age Age in years
7 wt Weight index = wt(kg) - ht(cm) + 200
8 pf Performance rating
9 hx History of cardiovascular disease
10 sbp Systolic blood pressure /10
11 dbp Diastolic blood pressure /10
12 ekg Electrocardiogram code
13 hg Serum Hemoglobin (gr / 100 ml)
14 sz Size of primary tumor (cm squared)
15 sg Combined index of stage and hist. grade
16 ap Serum prostatic acid phosphatase
17 bm Bone metastases
18 sdate Date on study (as days since January 1, 1960)

Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dpb, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration.

Recursive Partitioning

The standard approach to implement multiple imputations for the missing values is by using a recursive partitioning method. Recursive partitioning is an imputation method which preserves the interactions in the data automatically. Using simulated method, recursive partitioning creates appropriate variability between imputations and unbiased parameter estimates with appropriate confidence intervals. Besides, it is shown that the potential of recursive partitioning imputation approaches depends on the relevance of a possible interaction effect, the correlation structure of the data, and the type of possible interaction effect present in the data.

The data set provided for a few patients had some missing values for the variables Combined index of stage and grade, Size of the primary tumor, Age and Weight Index. Recursive partitioning method were used to generate the likelihood of the missing values based on the patient’s existing data for other variables.

To provide a better understanding, for an example, finding the missing value of the combined index of stage and historical grade for a patient with the value of 0.7 of serum prostatic acid phosphatase,(ap), size of primary tumor, sz of 42cm, and have history of bone metastases, bm which is equal to 1, the value that will be assigned for the respective patient based on the recursive partitioning method is sg = 11.

The simple way to understand is by writing down the rules (from the left) from the classification tree plot provided below. For the Combined index of stage and grade, there are five rules:

  1. First Rule: If the serum prostatic acid phosphatase < 1.05, and size of primary tumor < 25.5, and bone metastases < 0.5, then the missing value that we predict for the respective patient is 8.985 ~ 9.

  2. Second Rule: If the serum prostatic acid phosphatase < 1.05, and size of primary tumor < 25.5, and bone metastases >= 0.5, then the missing value that we predict for the respective patient is 11.

  3. Third Rule: If the serum prostatic acid phosphatase < 1.05, and size of primary tumor >= 25.5, then the missing value that we predict for the respective patient is 10.541 ~ 11.

  4. Fourth Rule: If the serum prostatic acid phosphatase >= 1.05, and size of primary tumor < 25.5, then the missing value that we predict for the respective patient is 11.83 ~ 12.

  5. Fifth Rule: If the serum prostatic acid phosphatase >= 1.05, and size of primary tumor >= 25.5, then the missing value that we predict for the respective patient is 12.58 ~ 13.

Combined index of stage and hist. grade Imputation

## n=491 (11 observations deleted due to missingness)
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 491 1996.94500 10.309570  
##   2) ap< 1.049927 295  587.51190  9.159322  
##     4) sz< 25.5 271  491.63100  9.036900  
##       8) bm< 0.5 264  439.93940  8.984848 *
##       9) bm>=0.5 7   24.00000 11.000000 *
##     5) sz>=25.5 24   45.95833 10.541670 *
##   3) ap>=1.049927 196  431.67350 12.040820  
##     6) sz< 25.5 141  315.91490 11.829790 *
##     7) sz>=25.5 55   93.38182 12.581820 *

The best way to make sure that we have successfully imputed the missing values is by observing the comparison between the original and imputed number of observations. Based on the summary table for Combined index of stage and grade, we can observe that there are no missing value in the imputed sg in comparison to original sg.

##             Min. 1st Qu. Median     Mean 3rd Qu. Max. NA's
## original sg    5       9     10 10.30957   11.00   15   11
## imputed  sg    5       9     10 10.31673   11.75   15    0

Same steps were repeated throughout the study in order to fill in the missing values for the respective variables. The tables below provides the new values that were imputed by using the recursive partitioning method for the missing values in the variables, Combined index of stage and grade, Size of the primary tumor, Age and Weight Index.

Size of primary tumor (cm squared) Imputation

Weight index Imputation

Age in years Imputation

Kaplan Meier (KM) Survival curves

In analyzing survival data, two functions that are dependent on time are of particular interest: the survival function and the hazard function. The survical function S(t), is defined as the probability of surviving at least to time t. The hazard function h(t) is the conditional probability of dying at time t having survived to that time.

The Kaplan-Meier estimator is a non-parametic approach of estimating survival probabilities and is one of the most frequently used method of survival analysis. KM survival curve plots estimate survival probabilities with respect to survival time. The nature of the curve is that it is downward slopping. The graph of S(t) against t is called the survival curve.

KM estimator involves computing of probabilities of occurence of event at a certain point of time. For each time interval, survival probability is calculated as the number of subjects surviving divided by the number of patients at risk, subjects who are lost are considered “censored” and are not counted in the denominator.Total probability of survival till that time interval is calculated by multiplying all the probabilities of survival at all time intervals preceding that time (by applying law of multiplication of probability to calculate cumulative probability). For example, the probability of a patient surviving two days after a kidney transplant can be considered to be probability of surviving the one day multiplied by the probability surviving the second day given that patient survived the first day. This second probability is called as a conditional probability. Although the probability calculated at any given interval is not very accurate because of the small number of events, the overall probability of surviving to each point is more accurate.

Let us look at the Overall Kaplan Meier curve of a group of patients with pancreatic cancer that were going through treatments. The time in months for the survival probability of survival at the end of a particular time is 0.50 is called as median survival time. Based on the graph, the median survival time is 34 months meaning that there is 50% chances that the patient with pancreatic cancer will be able to survive.

## Call: survfit(formula = Surv(dtime, dead) ~ 1, data = pc)
## 
##       n  events  median 0.95LCL 0.95UCL 
##     502     354      34      30      39

Goodness of Fit Using Residuals

The use of residuals for model checking has been widely used in regression analysis. An important tool for assessing the goodness of fit of a model is to compare censoring indicator (0 for censored, 1 for death) for each subject to the expected value of that indicator under the proportional hazards Cox model.

A null model is one with no fitted covariates. There is still a partial likelihood, and the model produces martingale residuals. We can plot against continuous predictors to get a preliminary assessment of which predictors should be in the model and what form they should take, meaning either continuous or categorical variable.

Cox Proportional Hazards Model with no predictors

Kaplan-Meier estimator cannot cope with complicated model with several factors or continuous covariates. However Cox Proportional Hazard Model can be used to model complicated structure. The Cox proportional hazards model is essentially a regression model which is commonly used statistically in medical research for investigating the association between the survival time of patients and one or more predictor variables.Therefore, it is important to assess whether a fitted Cox regression model can adequately describes the data.

Residuals are often used to investigate the lack of fit of a model. For Cox regression, there is no easy analog to the usual “observed minus predicted” residual of linear regression. Instead, several specialized residuals have been proposed for Cox regression analysis. The function calculates residuals that are well defined for an intercept only Cox regression model: the martingale and deviance residuals.

The martingale residual of a subject (patient) specifies excess failures beyond the expected baseline hazard. Since martingale residuals are not symmetrically distributed, even when the fitted model is correct, it is often advantageous to transform them into more symmetrically distributed residuals: deviance residuals. Thus, deviance residuals are defined as transformations of the martingale residual and the event variable. Deviance residuals are often symmetrically distributed around zero Deviance Residuals are similar to residuals from ordinary linear regression in that they are symmetrically distributed around 0 and have standard deviation of 1.0. In our study, a subject with a large deviance residual is poorly predicted by the model, i.e. is different from the baseline cumulative hazard. A negative value indicates a longer than expected survival time.

We will first use the null model to test against each predictor variables in order to determine whether or not the predictor variables is siginificant in predicting the model.

## Call:  coxph(formula = Surv(dtime, dead) ~ 1, data = pc)
## 
## Null model
##   log likelihood= -2010.754 
##   n= 502

Age

Based on the plot below, the variable age has a non-linear relationship against the Martingale Residual for the null model. We can see that at the age of 70 an above, the line gradually move upwards and majority of the observations were clustered between the age of 70 and the age of 80. Thus, the variable age provides adequate information that it might be a useful siginificant predictor.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Serum Hemoglobin

Based on the plot, the variable hg which indicates Serum Hemoglobin, has a non-linear relationship against the Martingale Residual for the null model. We can see that the observations were mostly clustered around 12.5 to 15 grams of serum hemoglobin. Thus, the variable hg provides adequate information that it might be a useful siginificant predictor.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Combined index of stage and hist. grade

Based on the boxplot, we can observed that the median distribution across the stages and historical grades are spread out which gives us an indication that there is variability across this variable. This proves that this variable will be significant covariate for the survival analysis.

History of Cardiovascular Disease

Based on the boxplot, we can see two outliers exist where the respective patients has history of cardiovascular disease and the median value for the cancer patients who have had history of cardiovascular disease is higher which shows that there is a higher chance that the probability in surviving is lower. This predictor proves that it is significant.

Weight Index

Based on the plot, the weight index has a non-linear relationship against the martingale residuals which proves that it is a significant variable.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Bone Metastases

Based on the boxplot, we can see outliers exist where the respective patients has history of bone metastases condition and the median value for the cancer patients who have had bone metastases disease is higher which shows that there is a higher chance that the probability in surviving is lower. This predictor proves to be significant.

Size of Primary Tumor

Based on the plot, note that the graph for sz against the null model martingale residuals constructed a non-linear relationship which proves that this variable is significant.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Cox Proportional Hazards Model

The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allow us to examine how specified factors influence the rate of a particular event happening and in our case, we are interested in studying the survival pattern based on the choice of treatment for cancer patients at a particular point in time. This rate is commonly referred as the hazard rate. Predictor variables or factors are usually termed covariates in the survival analysis.

The Cox model is expressed by the hazard function denoted by h(t). Briefly, the hazard funciton can be interpreted as the risk of dying at time t. It can be estimated as follow:

\[ \text h(t) = h_0 (t) * exp(\beta_1 x_1 +\beta_2 x_2 +...+\beta_p x_p) \]

where,

  1. t represents the survival time

  2. h(t) is the hazard function determined by a set of p covariates \[(x_1,x_2,...,x_p)\]

  3. the coefficients \[(b_1,b_2,...,b_p)\] measure the impact of covariates.

  4. the term h0 is called the baseline hazard. It corresponds to the value of the hazard if all the x_i are equal to zero. The ‘t’ in h(t) reminds us that the hazard may vary over time.

The quantities \[\text exp(\beta_i)\] are called hazard ratios (HR). A value of \[ \beta_i \] greater than zero, or equivalently a hazard ratio greater than one, indicates that as the value of the covariate increases, the event hazard increases and thus the length of survival decreases.

In a simpler way to understand, a hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary, - HR = 1: No effect - HR < 1: Reduction in Hazard - HR > 1: Increase in Hazard

For this study, we will fit the Cox regression using the following covariates: age, hg, sg, hx, wt, bm and sz.

We start by computing univariate Cox analyses for all these variables; then we will fit multivariate cox analyses using two variables to describe how the factors jointly impact on survival.

## Call:
## coxph(formula = Surv(dtime, dead) ~ rx, data = pc)
## 
##   n= 502, number of events= 354 
## 
##                        coef exp(coef)  se(coef)      z Pr(>|z|)  
## rx0.2 mg estrogen  0.032124  1.032645  0.145119  0.221    0.825  
## rx1.0 mg estrogen -0.385517  0.680099  0.156960 -2.456    0.014 *
## rx5.0 mg estrogen  0.002912  1.002916  0.145937  0.020    0.984  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## rx0.2 mg estrogen    1.0326     0.9684    0.7770    1.3724
## rx1.0 mg estrogen    0.6801     1.4704    0.5000    0.9251
## rx5.0 mg estrogen    1.0029     0.9971    0.7534    1.3350
## 
## Concordance= 0.531  (se = 0.016 )
## Likelihood ratio test= 9.71  on 3 df,   p=0.02
## Wald test            = 9.01  on 3 df,   p=0.03
## Score (logrank) test = 9.13  on 3 df,   p=0.03

The Cox regression results can be interpreted as follow:

  1. Statiscally significance. The column marked “z” gives the Wald statistic value. It corresponds to the ratio of each regression coefficent to its standard error (z = coef/se(coef)). The wald statistic evaluates, whether the beta coefficient of a given variable is statistically siginificantly different from 0. From the output above, we can conclude that the variable rx, which denoted as treatment have highly statiscally significant coefficients.

  2. The regression coefficients. The second feature to note in the Cox model results is the sign of the regression coefficients (coef). A positive sign means that the hazard (risk of death) is higher, and thus the prognosis worse for patients with higher values of that variable. The variable rx is encoded as a numeric vector . The R summary for the Cox model gives the hazard ratio (HR) for the three groups: “0.2 mg estrogen, 1.0 mg estrogen and 5.0 mg estrogen” relative to base group which is “placebo”. The beta coefficients for the most significant group is “1.0 mg estrogen” where the coefficient is -0.386 which indicates that patient who undergo treatment by using “1.0 mg estrogen” have lower risk of death compared to the other groups.

  3. Hazard ratios. The exponentiated coefficients for “1.0 mg os estrogen” (exp(coef) = exp(-0.386) = 0.6798), also known as hazard ratios, give the effect size of covariates. For example, the treatment “1.0 mg estrogen” reduces the hazard by a factor of 0.6798.

The next step that we are interested in analyzing is to describe how the predictor variables jointly impact on survival. We can test the factors by performing a multivariate Cox regression analysis.

The output above shows the regression beta coefficients, the effect sizes (given as hazard ratios) and statistical significance for each of the variables in relation to overall survival. Each factor is assessed through separate univariate Cox regression.Refer to the Appendix Section for the Martingale Residual plots for each univariate Cox regressions.

Multivariate Cox Regression Analysis

A Cox regression of time to death on the time-constant covariates is specified as follow:

## Call:
## coxph(formula = Surv(dtime, dead) ~ rx + pspline(i.age) + pspline(i.wt) + 
##     hg + i.sgf + hx + bm + i.sz, data = pc)
## 
##                            coef se(coef)      se2    Chisq   DF       p
## rx0.2 mg estrogen       0.00756  0.14696  0.14676  0.00265 1.00 0.95896
## rx1.0 mg estrogen      -0.47850  0.16251  0.16228  8.67019 1.00 0.00323
## rx5.0 mg estrogen      -0.06107  0.15451  0.15373  0.15619 1.00 0.69269
## pspline(i.age), linear  0.02031  0.00784  0.00783  6.70702 1.00 0.00960
## pspline(i.age), nonlin                             7.21538 3.07 0.06877
## pspline(i.wt), linear  -0.00980  0.00416  0.00416  5.54302 1.00 0.01855
## pspline(i.wt), nonlin                              5.49279 3.07 0.14541
## hg                     -0.05627  0.03163  0.03157  3.16438 1.00 0.07526
## i.sgf7                 -1.23314  0.68638  0.68526  3.22778 1.00 0.07240
## i.sgf8                 -0.31537  0.37265  0.37210  0.71621 1.00 0.39739
## i.sgf10                -0.61123  0.43766  0.43713  1.95046 1.00 0.16254
## i.sgf11                -0.01915  0.37417  0.37356  0.00262 1.00 0.95917
## i.sgf14                -0.36630  0.62564  0.62469  0.34280 1.00 0.55822
## i.sgf15                -0.12354  0.48279  0.47836  0.06548 1.00 0.79804
## hx                      0.48372  0.11344  0.11316 18.18219 1.00   2e-05
## bm                      0.22829  0.16293  0.16239  1.96332 1.00 0.16116
## i.sz                    0.01596  0.00468  0.00466 11.61815 1.00 0.00065
## 
## Iterations: 5 outer, 14 Newton-Raphson
##      Theta= 0.877 
##      Theta= 0.879 
## Degrees of freedom for terms= 3.0 4.1 4.1 1.0 6.0 1.0 1.0 1.0 
## Likelihood ratio test=109  on 21.1 df, p=6e-14
## n= 502, number of events= 354

From the output above, the variables rx, age, wt, sg, hx and i.sz have highly statistically significant coefficients, while the p-values for sg, hg and bm is greater than 0.05 which indicates that these two variables is not significant in predicting the survival function. The output below shows the Cox proportional hazard model that is the most ideal in discovering the patterns in survival analysis for the cancer patients.

From the output below, the p-value for this model is significant with the value of p = 7e-14, which indicates that the model is significant. This test evaluate the omnibus null hypothesis that all of the betas are 0. In the below example, the test statistics are in close agreement and the null hypothesis is soundly rejected.

In the multivariate Cox analysis, the covariates rx, age, wt, hx and i.sz remain significant (p < 0.05). However, the covariate rx 0.2 mf estrogen and rx 5.0 mg estrogen, fails to be siginificant where respective p-value is equal to 0.8579 and 0.5253, which is greater than 0.05.

  1. The p-value for rx1.0 mg estrogen is 0.0064, with a hazard ratio HR = exp(-0.422) = 0.6557, indicating a strong relationship between using the treatment of 1.o mg estrogen and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazaes For example, holding other covariates constant, getting treatment of 1.0 mf estrogen reduces the hazard by a factor of 0.6557. We can conclude that getting treatment with 1.0 mg estrogen is associated with good prognostic.

  2. The p-value for age is 0.004, with a hazard ratio HR = 1.2486, indicating a strong relationship between age valye and increased risk of death. Holding other covariates constant, a higher value of age is associated with a poor survival.

  3. The p-value for wt is 0.0013, with a hazard ratio HR = 0.9872, indicating a strong relationship between using weight index and decreased risk of death. Holding other covariates constant, a higher value of wt is associated with a better survival.

  4. The p-value for hx is 3.1e-06, with a hazard ratio HR = 1.6861, indicating a strong relationship between using History of cardiovascular disease and increased risk of death. Holding other covariates constant, a higher value of hx is associated with a poor survival.

  5. The p-value for sz is 6.0-05, with a hazard ratio HR = 1.018, indicating a strong relationship between using Size of primary tumor (cm squared) and increased risk of death. Holding other covariates constant, a higher value of sz is associated with a poor survival.

## Call:
## coxph(formula = Surv(dtime, dead) ~ rx + pspline(i.age) + pspline(i.wt) + 
##     hx + i.sz, data = pc)
## 
##                             coef  se(coef)       se2     Chisq   DF
## rx0.2 mg estrogen      -6.33e-04  1.46e-01  1.46e-01  1.88e-05 1.00
## rx1.0 mg estrogen      -4.06e-01  1.59e-01  1.59e-01  6.53e+00 1.00
## rx5.0 mg estrogen      -8.36e-02  1.51e-01  1.50e-01  3.08e-01 1.00
## pspline(i.age), linear  2.24e-02  7.74e-03  7.73e-03  8.37e+00 1.00
## pspline(i.age), nonlin                                1.03e+01 3.07
## pspline(i.wt), linear  -1.35e-02  3.99e-03  3.98e-03  1.15e+01 1.00
## pspline(i.wt), nonlin                                 7.11e+00 3.07
## hx                      4.76e-01  1.11e-01  1.11e-01  1.84e+01 1.00
## i.sz                    2.19e-02  4.20e-03  4.18e-03  2.72e+01 1.00
##                              p
## rx0.2 mg estrogen      0.99654
## rx1.0 mg estrogen      0.01060
## rx5.0 mg estrogen      0.57867
## pspline(i.age), linear 0.00382
## pspline(i.age), nonlin 0.01711
## pspline(i.wt), linear  0.00069
## pspline(i.wt), nonlin  0.07246
## hx                     1.8e-05
## i.sz                   1.8e-07
## 
## Iterations: 5 outer, 14 Newton-Raphson
##      Theta= 0.882 
##      Theta= 0.883 
## Degrees of freedom for terms= 3.0 4.1 4.1 1.0 1.0 
## Likelihood ratio test=85.2  on 13.1 df, p=1e-12
## n= 502, number of events= 354

Discussion and Comparison Between Non-parametric estimation of survival & Cox Proportional Hazard Model

In our study, the non-parametric estimation of survival method that we used was the Kaplan Meier Survival Estimates. This method estimates the survival rates and hazard from data that may be incomplete. KM method is normally used as a descriptive statistic by estimating the survival curve from the observed survival times without the assumption of an underlying probability distribution.

The main difference between KM Survival Analysis when compared to the Cox proportional Hazard Moddel is that the Kaplan Meier estimator is for estimating a homogeneous cumulative survival or cumulative incidence function in the absence of competing events. The KM Survival curve is a good way to visualize the estimation survival rate of a cancer patient at a certain given time.

Based on the dataset, we can compare curves for different groups of patients. For example, compare the survival pattern for patients who receive different treatments such as (placebo, 0.2 mg of estrogen, 1.0 mg of estrogen, 5.0 mg of estrogen). We can look for gaps in these curves in a horizontal or vertical direction. A vertical gap means that at a specific time point, one group had a greater fraction of subjects surviving. A horizontal gap means that it took longer for one group to experience a fraction of deaths.

The graph below shows the KM survival curve of the prostate cancer patients that received treatments. Patients who were treated by 1.0 mg of estrogen during their treatment session shows that the chances of surviving is higher when compared to other treatments. Furthermore, we can estimate the probability of survival at the of the median survival time for patients who received treatments to treat the cancer as display in the table below the plot. The survival curve is drawn as a step function meaning that the proportion surviving remains unchanged between the veents, even if there are some intermediate censored observations.

## Call: survfit(formula = Surv(dtime, dead) ~ rx, data = pc)
## 
##                      n events median 0.95LCL 0.95UCL
## rx=placebo         127     95   33.0      27      40
## rx=0.2 mg estrogen 124     95   30.0      26      41
## rx=1.0 mg estrogen 126     71   41.5      31      NA
## rx=5.0 mg estrogen 125     93   35.0      28      46

The survival curves can be compared statistically by testing the null hypothesis. This null hypothesis is statistically tested by another test known as Cox proportion hazard test.

In order for the distribution to be homogeneous, all the regression coefficients in the Cox model would have to be zero in the population. When obtaining survival estimates from a Cox model fit, we have to specify the values of all the covariates and check its significance. Cox proportional hazard model enables us to test the effect of other independent variables on survival times of different groups of cancer patients, just like the multiple regression model. Hazard is nothing but the dependent variable and can be defined as probability of drying at a given time. Another difference between Kaplan-Meier method and Cox proportional hazard model is that Cox proportion hazard test assume that the hazard ratio is constant over time.

Nonlinearity which is an incorrectly specified functional form in the parametric part of the model is a potential problem in Cox regression as it is in linear and generalized linear models. The Martingale Residuals is plotted in the Cox Proportional Model BUilding Section in order to detect the nonlinearity and also used to form component-plus-residual (or partial-residual) plots. We are able to filter the covariates that are linear and does not contribute much in giving us an indication that the variables is significant at discovering the survival patter of a cancer patient.

Having fit a Cox model to the data, it is often of interest to examine the estimated distribution of survival times. We can compare the analyze the deviance table where we can see that form the table below, on the Cox proportional model that was selected consist of the variables: rx, age, wt, hx and i.sz where the p-value = 0.001941 computed is significant (p < 0.05).

Another method that we can used to find the covariates that is significant in discovering the cause of death for cancer patients is through the Stepwise Regression Method which is a combination of forward and bakward selections. The initial stage is that the model is built with no predictors, and then sequentially add the most contributive predictors (like forward selecttion). After adding each new variables, any variables that no longer provide an improvement in the model fit will be remove (like bakward selection). The Stepwise regression will recommend the final model that has the least value of AIC and based on our study, the variables that were removed through this method are ap, dbp, sbp, ekg, pf which leaves us with a AIC Value of 3955.53.

Stepwise Regression is very useful for high-dimensional data containing multiple predictor variables. However, for survival analysis, it is not recommended because when there are missing values among covariates in the model, using a stepwise regression procedure may lead to serious problems, or result in coefficients that have the strong sign leading to an incorrect inference about the association among the covariates and outcome.

## Start:  AIC=4017.8
## Surv(dtime, dead) ~ rx
## 
##         Df    AIC
## + i.sgf  6 3997.8
## + i.sz   1 4001.0
## + bm     1 4002.1
## + hg     1 4002.8
## + hx     1 4004.5
## + i.age  1 4007.5
## + i.wt   1 4009.6
## + pf     3 4011.0
## + ekg    7 4015.6
## <none>     4017.8
## + dbp    1 4018.0
## + ap     1 4019.2
## + sbp    1 4019.7
## 
## Step:  AIC=3997.82
## Surv(dtime, dead) ~ rx + i.sgf
## Warning in fitter(X, Y, strats, offset, init, control, weights = weights, :
## Loglik converged before variable 10 ; coefficient may be infinite.
##         Df    AIC
## + hx     1 3981.7
## + hg     1 3988.3
## + i.age  1 3988.3
## + i.sz   1 3990.6
## + i.wt   1 3992.1
## + bm     1 3992.7
## + pf     3 3995.2
## + ekg    7 3997.5
## <none>     3997.8
## + dbp    1 3999.2
## + sbp    1 3999.4
## + ap     1 3999.8
## - i.sgf  6 4017.8
## 
## Step:  AIC=3981.72
## Surv(dtime, dead) ~ rx + i.sgf + hx
## 
##         Df    AIC
## + hg     1 3973.2
## + i.sz   1 3973.2
## + i.wt   1 3973.6
## + bm     1 3975.6
## + i.age  1 3975.7
## <none>     3981.7
## + pf     3 3982.6
## + dbp    1 3982.8
## + ekg    7 3983.4
## + ap     1 3983.7
## + sbp    1 3983.7
## - hx     1 3997.8
## - i.sgf  6 4004.5
## 
## Step:  AIC=3973.17
## Surv(dtime, dead) ~ rx + i.sgf + hx + hg
## 
##         Df    AIC
## + i.sz   1 3964.8
## + i.age  1 3968.7
## + i.wt   1 3969.2
## + bm     1 3970.0
## <none>     3973.2
## + ekg    7 3974.1
## + dbp    1 3974.9
## + ap     1 3975.0
## + sbp    1 3975.1
## + pf     3 3975.6
## - hg     1 3981.7
## - hx     1 3988.3
## - i.sgf  6 3990.6
## 
## Step:  AIC=3964.79
## Surv(dtime, dead) ~ rx + i.sgf + hx + hg + i.sz
## 
##         Df    AIC
## + i.wt   1 3959.7
## + i.age  1 3960.9
## + bm     1 3963.3
## <none>     3964.8
## + ekg    7 3965.4
## + dbp    1 3966.5
## + ap     1 3966.6
## + sbp    1 3966.8
## + pf     3 3967.5
## - i.sgf  6 3971.2
## - i.sz   1 3973.2
## - hg     1 3973.2
## - hx     1 3981.4
## 
## Step:  AIC=3959.72
## Surv(dtime, dead) ~ rx + i.sgf + hx + hg + i.sz + i.wt
## 
##         Df    AIC
## + i.age  1 3956.3
## + bm     1 3959.3
## <none>     3959.7
## + ekg    7 3960.1
## + ap     1 3961.4
## + sbp    1 3961.5
## + dbp    1 3961.7
## + pf     3 3962.4
## - hg     1 3964.0
## - i.wt   1 3964.8
## - i.sgf  6 3965.3
## - i.sz   1 3969.2
## - hx     1 3978.8
## 
## Step:  AIC=3956.29
## Surv(dtime, dead) ~ rx + i.sgf + hx + hg + i.sz + i.wt + i.age
## 
##         Df    AIC
## + bm     1 3955.5
## <none>     3956.3
## + ekg    7 3957.6
## + ap     1 3958.0
## + dbp    1 3958.1
## + sbp    1 3958.2
## + pf     3 3958.4
## - hg     1 3959.5
## - i.age  1 3959.7
## - i.wt   1 3960.9
## - i.sgf  6 3962.2
## - i.sz   1 3965.1
## - hx     1 3971.9
## 
## Step:  AIC=3955.53
## Surv(dtime, dead) ~ rx + i.sgf + hx + hg + i.sz + i.wt + i.age + 
##     bm
## 
##         Df    AIC
## <none>     3955.5
## - bm     1 3956.3
## + ap     1 3956.7
## + dbp    1 3957.3
## + sbp    1 3957.4
## - hg     1 3957.5
## + ekg    7 3957.5
## - i.sgf  6 3957.8
## + pf     3 3958.5
## - i.wt   1 3959.1
## - i.age  1 3959.3
## - i.sz   1 3962.8
## - hx     1 3971.4

Smooth Estimates of continuous Covariates

When a covariate is continuous, it is importat to test whether that covariate is related to survival. Based on the graph below, we are trying to understand how weight index influences the hazard function. The plot shows weight index versus log-hazard. In this case, we see that the smoothing spline is not linear with the increase and decrease relationship with the log-hazard.

all.cph <- coxph(Surv(dtime, dead) ~ rx + pspline(i.age) + pspline(i.wt) + pf + hx + sbp + dbp +
                     ekg + hg + i.sz + i.sg + ap + bm ,
                 data = pc)
res.step
## Call:
## coxph(formula = Surv(dtime, dead) ~ rx + pspline(i.age) + pspline(i.wt) + 
##     hx + hg + i.sz + i.sg, data = pc)
## 
##                            coef se(coef)      se2    Chisq   DF       p
## rx0.2 mg estrogen      -0.00855  0.14661  0.14642  0.00340 1.00  0.9535
## rx1.0 mg estrogen      -0.40885  0.15960  0.15933  6.56216 1.00  0.0104
## rx5.0 mg estrogen      -0.06013  0.15219  0.15138  0.15613 1.00  0.6927
## pspline(i.age), linear  0.01997  0.00778  0.00777  6.58642 1.00  0.0103
## pspline(i.age), nonlin                             7.79533 3.07  0.0533
## pspline(i.wt), linear  -0.01081  0.00410  0.00410  6.93957 1.00  0.0084
## pspline(i.wt), nonlin                              5.39939 3.07  0.1516
## hx                      0.51639  0.11243  0.11223 21.09412 1.00 4.4e-06
## hg                     -0.06375  0.03079  0.03074  4.28717 1.00  0.0384
## i.sz                    0.01780  0.00450  0.00447 15.65636 1.00 7.6e-05
## i.sg                    0.07779  0.02897  0.02882  7.21173 1.00  0.0072
## 
## Iterations: 5 outer, 14 Newton-Raphson
##      Theta= 0.88 
##      Theta= 0.882 
## Degrees of freedom for terms= 3.0 4.1 4.1 1.0 1.0 1.0 1.0 
## Likelihood ratio test=98.2  on 15.1 df, p=3e-14
## n= 502, number of events= 354
termplot(res.step, se = TRUE, terms = 3, ylabs = "Log Hazard")

### Log Cumulative Hazard Plots

It is important to note that if the graph of the hazards cross for two or more categories of a predictor of interest, the Proportional Hazards assumption is not met meaning that there is clear violation and the relationship defers over time. The assumption that the proportional stay constant over time can be inspected by looking at a graph showing the logarithm of the estimated cumulative hazard function. The assumption is equivalent to assume that the difference between the logarithms of the hazards for the two treatments does not change with time, or equally that the difference between the logarithms of the cumulative hazard functions is constant.

However, although the hazard functions do not cross, it is possible that the PH assumption is not met. Thus we can use the graphical approach techniques which involves comparing estimated -ln(-ln) survival curvers over different (combinations of) categories of the variables being investigated, which is in this case rx (treatments). A log-log survival curve is simply a transformation of an estimated survival curve that results from taking the natural log of an estimated survival probability twice.

Based on the log-log plot for treatment below it seems the log-cumulative plot for two groups of patients who took 1.0 mg estrogen and 5.0 mg estrogen, the lines for the two treatments are roughly parallel, suggesting that the proportional hazards assumption is reasonable in this case.Thus, the assumption of proportional hazards is valid meaning that there is no clear violation that the relationship is different over time.

The treatment that was used in this study shows that 1.0 mg estrogen is an effective treatment option. If we were to look at the Kaplan-Meier curve, rx 1.0 mg estrogen p-value and the log cumulative hazard plots. From all of these observations, we can see that the cancer patients who were treated by 1.0 mg estrogen has higher chances or probability of surviving. However, we did face some limitation due to the wide gap of the variability as all of the other treatment options that were used in this study does not seem to be an effective treatment as the patients who undergo those treatments as lower chances of survival rate when compare to 1.0 mg estrogen.

km.1mg <- survfit(Surv(dtime, dead) ~ rx,
                  data = pc, subset = rx == "1.0 mg estrogen")
time.1mg <- km.1mg$time
surv.1mg <- km.1mg$surv
cloglog.1mg <- log(-log(surv.1mg))
logtime.1mg <- log(time.1mg)
km.5mg <- survfit(Surv(dtime, dead) ~ rx,
                  data = pc, subset = rx == "5.0 mg estrogen")
time.5mg <- km.5mg$time
surv.5mg <- km.5mg$surv
cloglog.5mg <- log(-log(surv.5mg))
logtime.5mg <- log(time.5mg)
plot(cloglog.1mg ~ logtime.1mg, type = "s")
lines(cloglog.5mg ~ logtime.5mg, type = "s")

Summary and Conclusion

Survival analysis provides special techniques that are required to compare the risks for death (or some other event) associated with different treatments or groups, where the risk changes over time. In measuring survival time, the start and end points must be clearly defined, and the censored observations noted. Only the most commonly used techniques are introduced in this review. Kaplan-Meier method provides a statistical comparison of two groups, and Cox’s proportional hazards model allows addiditonal covariates to be included.

Both approaches generate estimates of the survival function, which can be used to estimate the probability that a patient survives to a specific time. It is often of interest to assess whether there are statistically significant differences in survival between groups (age), between competing treatment groups (“placebo”, “0.2 mg estrogen”, “1.0 mg estrogen”, “5.0 mg estrogen”) and the weight index of the patient’s group.

Hazard ratios often summarize the associations between risk factors and survival time in a Cox proportional hazards model. The hazard ratio for a dichotomous risk factor (i.e., treatment assignment in discovering the pattern survival) represents the increase or decrease in the hazard in one group as compared to the other.

For example, in a clinical trial with survival time as the outcome, if the hazard ratio is 0.5 comparing patients on an estrogen treatment to those on placebo, this suggests a 50% reduction in the hazard (risk of failure assuming the person survived at a certain point) in the treatment group as compared to the placebo. Another example is that if the hazard ratio is 1.15 comparing cancer patients with no historical information on cardiovascular disease, then the risk of failure or survival is 25% higher than the cancer patients with a history of cardiovascular disease.

To conclude, Kaplan-Meier method is a good method of statistical treatment of survival times which not only makes proper allowances for those observations that are censored but also makes use of the information from these subjects up to the time when they are censored. Cox proportion hazard model enables us to test the effect of other independent variables on survival times of different groups of patients, just like the multiple regression model. Hazard is nothing but the dependent variable and can be defined as the probability of dying at a given time, assuming that the patients have survived up to that given time. The hazard ratio is also a relevant term and defined as the rate of the risk of a hazard occurring at any given time in one group compared with another group at that very time

References

D.P. Byar and S.B. Green, “The choice of treatment for cancer patients based on covariate information: Application to prostate cancer”, Bulletin Cancer, Paris, 67:477-488, 1980.

Moore, Dirk Foster. Applied Survival Analysis Using R. Springer, 2016.

Appendix

## Call:
## coxph(formula = Surv(dtime, dead) ~ rx + pspline(i.age), data = pc)
## 
##                            coef se(coef)      se2    Chisq   DF       p
## rx0.2 mg estrogen       0.05567  0.14565  0.14546  0.14606 1.00 0.70233
## rx1.0 mg estrogen      -0.37917  0.15756  0.15745  5.79134 1.00 0.01611
## rx5.0 mg estrogen       0.01335  0.14874  0.14805  0.00806 1.00 0.92847
## pspline(i.age), linear  0.02867  0.00761  0.00761 14.20122 1.00 0.00016
## pspline(i.age), nonlin                             9.20659 3.08 0.02841
## 
## Iterations: 5 outer, 13 Newton-Raphson
##      Theta= 0.884 
## Degrees of freedom for terms= 3.0 4.1 
## Likelihood ratio test=31.6  on 7.06 df, p=5e-05
## n= 502, number of events= 354
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Call:
## coxph(formula = Surv(dtime, dead) ~ rx + pspline(i.age) + pspline(i.wt), 
##     data = pc)
## 
##                            coef se(coef)      se2    Chisq   DF       p
## rx0.2 mg estrogen       0.01770  0.14619  0.14594  0.01466 1.00 0.90364
## rx1.0 mg estrogen      -0.41010  0.15854  0.15832  6.69107 1.00 0.00969
## rx5.0 mg estrogen      -0.04604  0.15028  0.14952  0.09384 1.00 0.75935
## pspline(i.age), linear  0.02781  0.00764  0.00763 13.26062 1.00 0.00027
## pspline(i.age), nonlin                             8.59316 3.07 0.03741
## pspline(i.wt), linear  -0.01149  0.00395  0.00394  8.45400 1.00 0.00364
## pspline(i.wt), nonlin                              6.82962 3.08 0.08188
## 
## Iterations: 5 outer, 14 Newton-Raphson
##      Theta= 0.882 
##      Theta= 0.884 
## Degrees of freedom for terms= 3.0 4.1 4.1 
## Likelihood ratio test=47  on 11.1 df, p=2e-06
## n= 502, number of events= 354
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'