Practical 3 PART II

Poisson distribution

According to the National Office of Vital Statistics in the US Department of Health, the average number of accidental drownings per year in the United States is 3.0 per 100,000 persons in the population. Use the Poisson distribution to calculate the probability that in a city of 200,000 people there will be:

  1. 2 accidental drownings in a year
  2. between 3 to 6 (inclusive) accidental drownings in a year
  3. fewer than 3 accidental drowings in a year

Solutions: Practical 3 PART II - Poisson distribution

  1. 2 accidental drownings in a year

\[ \begin{aligned} 3 \text{ per 100,000} & \implies\lambda = 6 \text{ per 200,000}\\ &\text{thus we use Poisson distribution:}\\ f\left(x\right) &=Pr(X=x) =\frac{e^{-\lambda}\lambda^x}{x!} =\frac{e^{-6}6^x}{x!}\\ Pr(X=2) & =\frac{e^{-6}6^2}{2!}=0.0446 \end{aligned} \]

  1. between 3 to 6 (inclusive) accidental drownings in a year

\[ \begin{aligned} Pr(3\leq X\leq 6) & =\frac{e^{-6}6^3}{3!}+\frac{e^{-6}6^4}{4!}+\frac{e^{-6}6^5}{5!}+\frac{e^{-6}6^6}{6!}\\ Pr(3\leq X\leq 6) &=0.5443 \end{aligned} \]

  1. fewer than 3 accidental drowings in a year

\[ \begin{aligned} Pr(X\lt 3) & =\frac{e^{-6}6^0}{0!}+\frac{e^{-6}6^1}{1!}+\frac{e^{-6}6^2}{2!}\\ Pr(X\lt 3) &=0.0620 \end{aligned} \]

  1. at least 3 accidental drownings in a year

\[ \begin{aligned} P(X\geq 3) & =1-P(X\lt 3)\\ P(X\geq 3) & =1-\left(\frac{e^{-6}6^0}{0!}+\frac{e^{-6}6^1}{1!}+\frac{e^{-6}6^2}{2!}\right)\\ P(X\geq 3) & =1-0.0620\\ P(X\geq 3) & =0.938 \end{aligned} \]

Practical 5, 7 and 8. Cox PH Regression analysis and Diagnostics using R

Cox PH Regression analysis using “melanom” data

Using the “melanom” data found in the package ISwR, conduct a survival analysis using R software.

Question #1: What is malignant melanoma?

But before conducting the analysis, what is malignant melanoma? From literature, what are some of the key determinants of survival for patients who have undergone an operation in malignant melanoma?

Solution #1: What is malignant melanoma?

Malignant melanoma, also known as melanoma is a type of cancer occurs on the skin. Melanoma develops from melanocytes; melanocytes are pigment-producing cells. In rare cases melanoma may also occur in the eyes, mouth or intestines. In men, melanoma commonly occur in the back while in women they commonly occur on the legs. One of the causes of melanoma is ultraviolet light (UV) exposure that may be triggered by low levels of melanin, a skin pigment. Other predisposing factors include: many moles, genes, compromised immune system among others.

Diagnosis is done via biopsy.

Avoiding UV light and use of sunscreen may prevent melanoma. Treatment is removal of cancerous tissue through surgery. Biologic therapy, radiation therapy, immunotherapy or chemotherapy is use for treating advanced melanoma and may improve survival.

Key determinants of the survival of patients who have undergone surgery in melanoma include: tumor thickness, metastasis rate and whether or not the overlying skin has broken down (commonly referred as ulceration).

Question #2: Description of “melanom” data

Consider the “melanom” data. Which variables do we find in the data? What is the structure (string, factor, numeric, etc) of the variables in the data? Convert ulceration and sex to factor variables and check that the structure of the variables in the data recognizes the two as factors. To check the structure of the variables, use the R code:
str(melanom)
Explain what the data is all about. Explain in general, the role of covariates in the melanoma data in relation to M. melanoma.

Soution: Question #2: Description of “melanom” data

knitr::opts_chunk$set(echo = TRUE)
library(ISwR)
data("melanom")
names(melanom)
## [1] "no"     "status" "days"   "ulc"    "thick"  "sex"
str(melanom)
## 'data.frame':    205 obs. of  6 variables:
##  $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
##  $ status: int  3 3 2 3 1 1 1 1 3 1 ...
##  $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
##  $ sex   : int  2 2 2 1 2 2 2 2 1 1 ...
#?melanom() # get detailed description of "melanom" data
melanom$ulc <- as.factor(melanom$ulc) # convert to factor
melanom$sex <- as.factor(melanom$sex) # convert to factor

class(melanom$sex) # check the variable type
## [1] "factor"
class(melanom$ulc) # check the variable type
## [1] "factor"

The melanom data frame has 205 rows and 7 columns. It contains data relating to the survival of patients after an operation for malignant melanoma, collected at Odense University Hospital by K.T. Drzewiecki. The data frame has the following variables:

  • \(\texttt{no}\) - a numeric vector, patient code
  • \(\texttt{status}\) - a numeric vector code, survival status; 1: dead from melanoma, 2: alive, 3: dead from other cause.
  • \(\texttt{days}\) - a numeric vector, observation time.
  • \(\texttt{ulc}\) - a numeric vector code, ulceration; 1: present, 2: absent.
  • \(\texttt{thick}\) - a numeric vector, tumor thickness (1/100 mm).
  • \(\texttt{sex}\) - a numeric vector code; 1: female, 2: male.

Survival following operation may depend on a number of factors for example, tumor thickness. These factors generally become the covariates when fitting a Cox PH model to the data. For example, \(\texttt{thick}\), \(\texttt{sex}\), \(\texttt{ulc}\) in the “melanom” data. However, their effect on the survival should be tested to check if they are significant.

Question #3: Cox PH with covariate “sex”

You wish to test the effect of covariate sex on post operation survival for malignant melanoma. Determine the hazard ratio associated with the covariate “sex.” What is its confidence interval? Based on the associated p-value, do you think that males are more likely to die after an operation from M. melanoma?

Solution: Question #3: Cox PH with covariate “sex”

knitr::opts_chunk$set(echo = TRUE)
library(survival)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:ISwR':
## 
##     lung
(cox.fit <- coxph(Surv(days, status == 1) ~ sex, data = melanom)) # Fit a cox regression model with covariate sex
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex, data = melanom)
## 
##        coef exp(coef) se(coef)     z      p
## sex2 0.6622    1.9390   0.2651 2.498 0.0125
## 
## Likelihood ratio test=6.15  on 1 df, p=0.01314
## n= 205, number of events= 57
summary(cox.fit)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex, data = melanom)
## 
##   n= 205, number of events= 57 
## 
##        coef exp(coef) se(coef)     z Pr(>|z|)  
## sex2 0.6622    1.9390   0.2651 2.498   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## sex2     1.939     0.5157     1.153      3.26
## 
## Concordance= 0.59  (se = 0.034 )
## Likelihood ratio test= 6.15  on 1 df,   p=0.01
## Wald test            = 6.24  on 1 df,   p=0.01
## Score (logrank) test = 6.47  on 1 df,   p=0.01

The model fit above is significant (Likelihood ratio test = 6.15 on 1 d.f, p-value = 0.01). The hazard ratio associated with the covariate “\(\texttt{sex}\)” is 1.939, with a confidence interval of (1.153, 3.26) and p-value of 0.0125. Thus the effect of sex on post operation survival for malignant melanoma is significant.

Interpretation: males are \(93.9\%\) more likely to die following operation per unit time as compared to females.

Question #4: Cox PH with covariates “sex” and “thick”

Adjust for Tumour thickness, in the model in 2 above (model the logarithm of thickness instead of thickness in the model). How does the effect of sex change? What is its effect of tumour thickness itself on the post operation survival to M. melanoma?

Solution: Question #4: Cox PH with covariates “sex” and “thick”

knitr::opts_chunk$set(echo = TRUE)
# Fit a cox regression model with sex and thick as covariates
library(survival)
melanom$logthick <- log(melanom$thick)
(cox.fit2 <- coxph(Surv(days, status == 1) ~ sex + logthick, data = melanom))
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + logthick, data = melanom)
## 
##            coef exp(coef) se(coef)     z        p
## sex2     0.4580    1.5809   0.2687 1.705   0.0883
## logthick 0.7809    2.1834   0.1573 4.963 6.94e-07
## 
## Likelihood ratio test=33.45  on 2 df, p=5.453e-08
## n= 205, number of events= 57
summary(cox.fit2)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + logthick, data = melanom)
## 
##   n= 205, number of events= 57 
## 
##            coef exp(coef) se(coef)     z Pr(>|z|)    
## sex2     0.4580    1.5809   0.2687 1.705   0.0883 .  
## logthick 0.7809    2.1834   0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## sex2         1.581     0.6326    0.9337     2.677
## logthick     2.183     0.4580    1.6040     2.972
## 
## Concordance= 0.749  (se = 0.033 )
## Likelihood ratio test= 33.45  on 2 df,   p=5e-08
## Wald test            = 31  on 2 df,   p=2e-07
## Score (logrank) test = 32.52  on 2 df,   p=9e-08

The model fit above is significant (Likelihood ratio test = 33.45 on 2 d.f, p-value = <0.001). Adjusting for tumor thickness, the hazard ratio associated with the covariate “sex” now changes to 1.581, with a confidence interval of (0.9337, 2.677) and p-value of 0.09. Thus the effect of sex on post operation survival for malignant melanoma is no longer significant after adjusting for tumor thickness.

Interpretation: Adjusting for tumor thickness, males are 58.1% more likely to die per unit time as compared to females.

Adjusting for sex, the hazard ratio associated with the covariate log of “thick” is 2.183, with a confidence interval of (1.604, 2.972) and p-value of <0.001. Thus the effect of tumor thickness on post operation survival for malignant melanoma is significant after adjusting for sex.

Interpretation: Adjusting for sex, a person is 2.183 times likely to die per unit time for every additional unit in log of tumor thickness.

Question #5: Cox PH with covariates “sex” and “thick” with “ulceration” as a stratification covariate

To the model in 3 above, add ulceration as a stratification covariate. Is there any effect modification on the covariates sex and thickness of tumour?

Solution: Question #5: Cox PH with covariates “sex” and “thick” with “ulceration” as a stratification covariate

knitr::opts_chunk$set(echo = TRUE)
# Fit a cox regression model with covariates sex and thick; and ulceration as a stratification covariate
library(survival)
(cox.fit3 <- coxph(Surv(days, status == 1) ~ sex + logthick + strata(ulc), data = melanom))
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + logthick + strata(ulc), 
##     data = melanom)
## 
##            coef exp(coef) se(coef)     z      p
## sex2     0.3600    1.4333   0.2702 1.332 0.1828
## logthick 0.5599    1.7505   0.1784 3.139 0.0017
## 
## Likelihood ratio test=13.3  on 2 df, p=0.001296
## n= 205, number of events= 57
summary(cox.fit3)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + logthick + strata(ulc), 
##     data = melanom)
## 
##   n= 205, number of events= 57 
## 
##            coef exp(coef) se(coef)     z Pr(>|z|)   
## sex2     0.3600    1.4333   0.2702 1.332   0.1828   
## logthick 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## sex2         1.433     0.6977     0.844     2.434
## logthick     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

The model above is significant (Likelihood ratio test = 13.3 on 2 d.f, p-value = 0.001). Adding a stratification variable \(\texttt{ulc}\) modifies the effect of \(\texttt{sex}\) and \(\texttt{tumor thickness}\) on survival for malignant melanoma. \(\texttt{sex}\) is no longer significant in predicting survival for malignant melanoma (p-value = 0.18). However, tumor thickness is still significant (p-value = 0.0017) with hazards ratio 1.750 and confidence interval (1.234, 2.483).

Adjusting for sex and stratifying by ulceration, a person is \(75\%\) more likely to die per unit time for every additional unit in log of tumor thickness.

Adjusting for tumor thickness and stratifying by ulceration, a male patient is \(43.3\%\) more likely to die per unit time as compared to a female patient. However, the effect of sex is not statistically significant.

Question #6: Models comparison using LRT statistics

Compare the above models using the Likelihood Ratio statistics.

Solution: Question #6: Models comparison using LRT statistics

knitr::opts_chunk$set(echo = TRUE)
# To test for significance of model fit, the p-value of the likelihood ratio test statistic can be obtained as follows:
#anova(cox.fit, cox.fit2, test="LRT") # compare model 1 with model 2
#anova(cox.fit2, cox.fit3, test="LRT") # compare model 2 with model 3

# Alternative R commands
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lmtest)
lrtest(cox.fit, cox.fit2)
## Likelihood ratio test
## 
## Model 1: Surv(days, status == 1) ~ sex
## Model 2: Surv(days, status == 1) ~ sex + logthick
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -280.12                         
## 2   2 -266.48  1 27.299  1.743e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(cox.fit2, cox.fit3)
## Likelihood ratio test
## 
## Model 1: Surv(days, status == 1) ~ sex + logthick
## Model 2: Surv(days, status == 1) ~ sex + logthick + strata(ulc)
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   2 -266.48                         
## 2   2 -229.74  0 73.475  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Comparing model with covariates sex and tumor thickness with the one covariate sex only, the one with covariates sex and tumor thickness is a better model (likelihood ratio test = 27.299 on 1 d.f, p-value = <0.001).
  • Comparing model with covariates sex and tumor thickness and ulc as a stratification variable with the one with covariates sex and tumor thickness, the one with covariates sex and tumor thickness and ulc as a stratification variable is a better model (likelihood ratio test = 73.475 on 0 d.f, p-value = <0.001).

Practical 5, 7 and 8. Cox PH Regression Diagnostics using R

Part I: Cox PH Regression Diagnostics “melanom” data

Question #1: PH assumption

Check the proportional hazard assumption for all covariates in the melanoma data.

Solution: Question #1: PH assumption

The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals. In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption. The function cox.zph() [in the survival package] provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox refression model fit. For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole. The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

knitr::opts_chunk$set(echo = TRUE)
library(survival)
library(ggplot2)
library(survminer)
## Loading required package: ggpubr
## 
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
## 
##     myeloma
cox.fit3 <- coxph(Surv(days, status == 1) ~ sex + logthick + strata(ulc), data = melanom)

# test for the proportional-hazards (PH) assumption
(test.ph <- cox.zph(cox.fit3))
##          chisq df     p
## sex      0.754  1 0.385
## logthick 4.469  1 0.035
## GLOBAL   4.755  2 0.093

From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

It’s possible to do a graphical diagnostic using the function ggcoxzph() [in the survminer package], which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.

knitr::opts_chunk$set(echo = TRUE)
library(survival)
library(ggplot2)
library(survminer)
ggcoxzph(test.ph)

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit. Systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1, \beta_2\) do not vary much over time. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex and tumor thickness.

Question #2: Checking for influencial observations

Are there any influential observations in any of the covariates?

Solution: Question #2: Checking for influencial observations

To test influential observations or outliers, we can visualize either:

  • the deviance residuals or
  • the dfbeta values

The function ggcoxdiagnostics()[in survminer package] provides a convenient solution for checkind influential observations. The simplified format is as follow:

ggcoxdiagnostics(fit, type = , linear.predictions = TRUE)

  • fit: an object of class coxph.object
  • type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”).
  • linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE) or just indexed of observations (FALSE) on X axis.

Specifying the argument type = “dfbeta”, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise, type=“dfbetas” produces the estimated changes in the coefficients divided by their standard errors.

knitr::opts_chunk$set(echo = TRUE)
ggcoxdiagnostics(cox.fit3, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

(Index plots of dfbeta for the Cox regression of survival time post melanoma operation on sex and tumor thickness)
The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for sex and log of tumor thickness are large compared with the others.

It’s also possible to check outliers by visualizing the deviance residuals. The deviance residual is a normalized transform of the martingale residual. These residuals should be roughtly symmetrically distributed about zero with a standard deviation of 1.

  • Positive values correspond to individuals that “died too soon” compared to expected survival times.
  • Negative values correspond to individual that “lived too long”.
  • Very large or small values are outliers, which are poorly predicted by the model.

Example of deviance residuals:

knitr::opts_chunk$set(echo = TRUE)
ggcoxdiagnostics(cox.fit3, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

The pattern looks fairly symmetric around 0.

Question #3: Linearity assumption

For the continuous variable “tomour thickness,” check if the linearity assumption holds and if not, which functional form is suitable for modeling thickness (logarithm, square root, etc)?

Solution: Question #3: Linearity assumption

Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate. For a given continuous covariate, patterns in the plot may suggest that the variable is not properly fit.
Martingale residuals may present any value in the range \((-\infty, +1)\):

  • A value of martinguale residuals near 1 represents individuals that “died too soon”,
  • and large negative values correspond to individuals that “lived too long”.

To assess the functional form of a continuous variable in a Cox proportional hazards model, we use the function ggcoxfunctional() [in the survminer R package].
The function ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. This might help to properly choose the functional form of continuous variable in the Cox model. Fitted lines with lowess function should be linear to satisfy the Cox proportional hazards model assumptions.

knitr::opts_chunk$set(echo = TRUE)
ggcoxfunctional(Surv(days, status == 1) ~ thick + log(thick) + sqrt(thick), data = melanom)
## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.

It appears that, nonlinearity is slightly here. Logarithm appears to be the most suitable functional form for modelling tumor thickness.

Part II: “ovarian” data

Now consider the ovarian data in survival package in R.

Question #1: “ovarian” data description

What is the data about? Explain the covariates and the survival and status variable outcome in the data?

Solution: Question #1: “ovarian” data description

knitr::opts_chunk$set(echo = TRUE)
library(survival) # load the survival package
data("ovarian") # Get ovarian data in survival package
## Warning in data("ovarian"): data set 'ovarian' not found
str(ovarian) # 
## 'data.frame':    26 obs. of  6 variables:
##  $ futime  : num  59 115 156 421 431 448 464 475 477 563 ...
##  $ fustat  : num  1 1 1 0 1 0 1 1 0 1 ...
##  $ age     : num  72.3 74.5 66.5 53.4 50.3 ...
##  $ resid.ds: num  2 2 2 2 2 1 2 2 2 1 ...
##  $ rx      : num  1 1 1 2 1 1 2 2 1 2 ...
##  $ ecog.ps : num  1 1 2 1 1 2 2 2 1 2 ...
#help("ovarian") # Get description of ovarian data

Ovarian Cancer Survival Data
Description
Survival in a randomised trial comparing two treatments for ovarian cancer

  • \(\texttt{futime}\): survival or censoring time
  • \(\texttt{fustat}\): censoring status
  • \(\texttt{age}\): in years
  • \(\texttt{resid.ds}\): residual disease present (1=no,2=yes)
  • \(\texttt{rx}\): treatment group
  • \(\texttt{ecog.ps}\): ECOG performance status (1 is better)

“ovarian” data provides survival status of ovarian cancer patients. Survival may depend on various factors such as age, treatment, and presence of residual disease. These factors generally become the covariates when fitting a Cox PH model to the data. However, their effect on the survival should be tested to check if they are significant.

Question #2: CoxPH model with covariates age and ECOG

For a CoxPH model with age and ECOG performance as covariates, test for proportional hazards assumption in the covariates.

Solution: Question #2: CoxPH model with covariates age and ECOG

Cox regression model with covariates age and ECOG performance

knitr::opts_chunk$set(echo = TRUE)
# Fit a cox regression model with covariates age and ECOG performance 
library(survival)
(cox.model <- coxph(Surv(futime, fustat == 1) ~ age + factor(ecog.ps), data = ovarian))
## Call:
## coxph(formula = Surv(futime, fustat == 1) ~ age + factor(ecog.ps), 
##     data = ovarian)
## 
##                     coef exp(coef) se(coef)     z       p
## age              0.16150   1.17527  0.04992 3.235 0.00122
## factor(ecog.ps)2 0.01866   1.01884  0.59908 0.031 0.97515
## 
## Likelihood ratio test=14.29  on 2 df, p=0.000787
## n= 26, number of events= 12
summary(cox.model)
## Call:
## coxph(formula = Surv(futime, fustat == 1) ~ age + factor(ecog.ps), 
##     data = ovarian)
## 
##   n= 26, number of events= 12 
## 
##                     coef exp(coef) se(coef)     z Pr(>|z|)   
## age              0.16150   1.17527  0.04992 3.235  0.00122 **
## factor(ecog.ps)2 0.01866   1.01884  0.59908 0.031  0.97515   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## age                  1.175     0.8509    1.0657     1.296
## factor(ecog.ps)2     1.019     0.9815    0.3149     3.296
## 
## Concordance= 0.784  (se = 0.085 )
## Likelihood ratio test= 14.29  on 2 df,   p=8e-04
## Wald test            = 10.54  on 2 df,   p=0.005
## Score (logrank) test = 12.26  on 2 df,   p=0.002

The model above is significant (Likelihood ratio test = 14.29 on 2 d.f, p-value <0.001)

The proportional hazard ratio for covariate age is 1.175 with confidence interval (1.066, 1.296) and p-value = 0.001. Thus age is significant in predicting survival adjusting for ECOG performance. Hence adjusting for ECOG performance, an individual is 17.5% more likely to die per unit time for every additional year. ECOG performance is not significant (p-value = 0.975) adjusting for age.

Proportional hazards
The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals. In principle, the Schoenfeld residuals are independent of time. A plot that shows a non-random pattern against time is evidence of violation of the PH assumption. The function cox.zph() [in the survival package] provides a convenient solution to test the proportional hazards assumption for each covariate included in a Cox refression model fit. For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole. The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

knitr::opts_chunk$set(echo = TRUE)
# test for the proportional-hazards (PH) assumption for covariates age and ECOG performance
(test.ph2 <- cox.zph(cox.model))
##                 chisq df    p
## age             0.698  1 0.40
## factor(ecog.ps) 2.371  1 0.12
## GLOBAL          3.633  2 0.16

From the output above, the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

It’s possible to do a graphical diagnostic using the function ggcoxzph() [in the survminer package], which produces, for each covariate, graphs of the scaled Schoenfeld residuals against the transformed time.

knitr::opts_chunk$set(echo = TRUE)
ggcoxzph(test.ph2)

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit. Systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1, \beta_2\) do not vary much over time. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates age and ecog.

Question #3: Cox PH model with covariates age and ECOG and an interaction term

Construct a similar model with an interaction term between age and ECOG performance. Test for proportional hazards assumption again. What do you observe?

Solution: Question #3: Cox PH model with covariates age and ECOG and an interaction term

Cox regression model with interaction between age ECOG performance

knitr::opts_chunk$set(echo = TRUE)
# Fit a cox regression model with covariates age and ECOG performance and interaction term between the two covariates
library(survival)
(cox.model2 <- coxph(Surv(futime, fustat == 1) ~ age * ecog.ps, data = ovarian))
## Call:
## coxph(formula = Surv(futime, fustat == 1) ~ age * ecog.ps, data = ovarian)
## 
##                  coef exp(coef)  se(coef)      z     p
## age          0.159995  1.173505  0.134564  1.189 0.234
## ecog.ps     -0.042843  0.958061  5.140546 -0.008 0.993
## age:ecog.ps  0.001008  1.001008  0.083669  0.012 0.990
## 
## Likelihood ratio test=14.29  on 3 df, p=0.00253
## n= 26, number of events= 12
summary(cox.model2)
## Call:
## coxph(formula = Surv(futime, fustat == 1) ~ age * ecog.ps, data = ovarian)
## 
##   n= 26, number of events= 12 
## 
##                  coef exp(coef)  se(coef)      z Pr(>|z|)
## age          0.159995  1.173505  0.134564  1.189    0.234
## ecog.ps     -0.042843  0.958061  5.140546 -0.008    0.993
## age:ecog.ps  0.001008  1.001008  0.083669  0.012    0.990
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## age            1.1735     0.8521 9.015e-01     1.528
## ecog.ps        0.9581     1.0438 4.034e-05 22752.743
## age:ecog.ps    1.0010     0.9990 8.496e-01     1.179
## 
## Concordance= 0.784  (se = 0.083 )
## Likelihood ratio test= 14.29  on 3 df,   p=0.003
## Wald test            = 10.55  on 3 df,   p=0.01
## Score (logrank) test = 12.27  on 3 df,   p=0.007

The model is significant after including interaction between age and ECOG performance (Likelihood ratio test = 14.29 p-value = 0.003) though p-value has changed from <0.001 to 0.003. However, after adding interaction term the covariates are no longer significant (p-values > 0.05)

Proportional Hazards

knitr::opts_chunk$set(echo = TRUE)
# test for the proportional-hazards (PH) assumption
(test.ph3 <- cox.zph(cox.model2))
##             chisq df     p
## age         0.732  1 0.392
## ecog.ps     3.443  1 0.064
## age:ecog.ps 2.285  1 0.131
## GLOBAL      4.444  3 0.217

From the output above, the test is not statistically significant for each of the covariates, including the interaction term, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

knitr::opts_chunk$set(echo = TRUE)
ggcoxzph(test.ph3)

In the figure above, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit. Systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates \(\beta_1, \beta_2, \beta_3\) do not vary much over time. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates age and ecog and interaction term.

Question #4: Linearity assumption

Is the linear functional form for age suitable in this model? If not (or rather assume so), what functional form is appropriate for the covariate age in this model?

Solution: Question #4: Linearity assumption

Plotting the Martingale residuals against continuous covariates is used to detect nonlinearity or, in other words, to assess the functional form of a covariate. For a given continuous covariate (age in this case), patterns in the plot may suggest that the variable is not properly fit. Martingale residuals may present any value in the range \((-\infty, +1)\):

  • A value of martinguale residuals near 1 represents individuals that “died too soon”,
  • and large negative values correspond to individuals that “lived too long”.
knitr::opts_chunk$set(echo = TRUE)
ggcoxfunctional(Surv(futime, fustat == 1) ~ age + log(age) + sqrt(age), data = ovarian)
## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.

Linear form for age maybe suitable for for the model. However we may take logarithm or squareroot functional form of age to eliminate non-linearility.

Question #5: Checking for influential observations

Are there any influential observations in any of the covariates that may lead to over or under - estimation of the effect of any of the covariates?

Solution: Question #5: Checking for influential observations

To test influential observations or outliers, we can visualize either:

  • the deviance residuals or
  • the dfbeta values

The function ggcoxdiagnostics()[in survminer package] provides a convenient solution for checkind influential observations. The simplified format is as follow:

ggcoxdiagnostics(fit, type = , linear.predictions = TRUE)

  • fit: an object of class coxph.object
  • type: the type of residuals to present on Y axis. Allowed values include one of c(“martingale”, “deviance”, “score”, “schoenfeld”, “dfbeta”, “dfbetas”, “scaledsch”, “partial”).
  • linear.predictions: a logical value indicating whether to show linear predictions for observations (TRUE) or just indexed of observations (FALSE) on X axis.

Specifying the argument type = “dfbeta”, plots the estimated changes in the regression coefficients upon deleting each observation in turn; likewise, type=“dfbetas” produces the estimated changes in the coefficients divided by their standard errors.

knitr::opts_chunk$set(echo = TRUE)
# model without interaction between age and ECOG
ggcoxdiagnostics(cox.model, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

# model with interaction term between age and ECOG
ggcoxdiagnostics(cox.model2, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

(First plot: Index plots of dfbeta for the Cox regression of survival time on age and ECOG performance)
(Second plot: Index plots of dfbeta for the Cox regression of survival time on age and ECOG performance and the interaction term)
The above index plots show that comparing the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually, even though some of the dfbeta values for age and ECOG performance and interaction term are large compared with the others.

It’s also possible to check outliers by visualizing the deviance residuals. The deviance residual is a normalized transform of the martingale residual. These residuals should be roughtly symmetrically distributed about zero with a standard deviation of 1.

  • Positive values correspond to individuals that “died too soon” compared to expected survival times.
  • Negative values correspond to individual that “lived too long”.
  • Very large or small values are outliers, which are poorly predicted by the model.
knitr::opts_chunk$set(echo = TRUE)
# model with interaction term between age and ECOG performance
ggcoxdiagnostics(cox.model2, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

# model without interaction term 
ggcoxdiagnostics(cox.model, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula = 'y ~ x'

For both models, the pattern looks fairly symmetric around 0.