Lung Data - Survival Analysis

Credits

“David G. Kleinbaum and Mitchel Klein, Survival Analysis. A Self-Learning Text, Third Edition.”

Summary

Introduction to Survival Analysis
- Presentation
- Censored data
  - Right-censored data
  - Left-censored data
  - Interval-censored data
- Some definitions
- Notations
- Basic Data Layout for computer
The Lung dataset
- Missing values
- Count of unique values per column
- Data desciption
- Outlier summary
- Normality distribution of each variables
- Correlations
- Time effects
Survival object
- Kaplan-Meier model
- Kaplan-Meier General model
- Example
- Compute the Kaplan-Meier model
- Using Variables: Kaplan Meier model with sex
- Comparing Survival Curves
- Transforming the survival curve 1- “log”: log transformation 2- “event”: plots cumulative events 3- “cumhaz” plots the cumulative hazard
- Conditional survival curves
Cox PH model
- Cox PH General model
- Compute the Cox Model
  - COX PH model with sex variable only
- Validity of the Cox PH model
  - Testing proportional hazard
  - Testing influential observations
    1. dfbeta values
    2. deviance residuals
  - Comparison of the KM model and COX PH model with the variable sex
  - COX PH model with age variable only
  - Smooth survival plot - quantile of survival of age
  - Transform age as a categorical variable
- Model selection with the cox PH model
  1. Model with all variables
  2. Model without meal.cal
  3. Model with ECOG score, physician-rated Karnofsky score, and weight loss
  4. Model without ph.karno
  5. Model with ph.ecog modeled as categorical
  6. Model with age modeled as categorical
- Model selection with glmulti

library(knitr)
library(kableExtra)

library(DataExplorer) # for EDA
library(dlookr) # for outlier analysis
library(summarytools)
library(corrplot)

library(survival)
library(survminer)
library(powerSurvEpi)
library(rms)  # to extract survival times from KM curve, using survest function
library(sm)  # to do quantile regression of survival data vs. continuous variable (sm.survival)

library(glmulti)  # for best subsets selection of Cox PH model


# for conditional survival plots
library(condsurv)  # remotes::install_github("zabore/condsurv")

# to extract survival probabilities, by group, from survfit (KM plot)
library(survtools)  # remotes::install_github("RichardBirnie/survtools")

# put this last so that if necessary, it masks stuff from above
library(tidyverse)

Introduction to Survival Analysis

Presentation

Generally, survival analysis is a collection of statistical procedures for data analysis for which the outcome variable of interest is time until an event occurs.

In the medical world, we typically think of survival analysis literally – tracking time until death. But, it’s more general than that – survival analysis models time until an event occurs (any event). This might be death of a biological organism. But it could also be the time until a hardware failure in a mechanical system, time until recovery, time someone remains unemployed after losing a job, time until a ripe tomato is eaten by a grazing deer, time until someone falls asleep in a workshop, etc. Survival analysis also goes by reliability theory in engineering, duration analysis in economics, and event history analysis in sociology.

Type of events: death, disease, relapse, recovery…

The goal of survival analysis is to:

Goal 1: To estimate and interpret survivor and/or hazard functions from survival data.
Goal 2: To compare survivor and/or hazard functions.
Goal 3: To assess the relationship of explanatory variables to survival time

Censored data

One aspect that makes survival analysis difficult is the concept of censoring. In essence, censoring occurs when we have some information about individual survival time, but we don’t know the survival time exactly. There are 3 main reasons why this happens:

Individual does not experience the event when the study is over.
Individual is lost to follow-up during the study period.
Individual withdraws from the study.

There are 3 major times of censoring: right, left and interval censoring.

1. Right-censored data

In right censoring, the true survival times will always be equal to or greater than the observed survival time. We have \(t_1\) = lower bound and \(t_2\) = \(\infty\). Consider the follow example where we have 3 patients (A, B, C) enrolled onto a clinical study that runs for some period of time (study end - study start).

Patient A requires no censoring since we know their exact survival time which is the time until death.
Patient B survives passed the end of the study and needs to be censored (indicated with the + at the end of the follow-up time).
Patient C withdraws from the study and needs to be censored since they withdrew before the study ended.

Note: This example is a bit unrealistic since it’s rare to have all patients enrolled at the same time in a study. In reality, patients would be enrolled at different times in the trial. But this example is meant more so to illustrate the concepts of censoring.

1. Left-censored data

In contrast to right-censoring, left censoring occurs when the person’s true survival time is less than or equal to the observed survival time. We have \(t_1\) = 0 and \(t_2\) = upper bound.

An example of a situation could be for virus testing. For instance, if we’ve been following an individual and recorded an event when for instance the individual tests positive for a virus but we don’t know the exact time of when the individual was exposed to the disease. We only know that there was some exposure between 0 and the time they were tested.

1. Interval-censored data

Survival analysis data can also be interval-censored, which can occur if a subject’s true (but unobserved) survival time is within a certain known specified time interval.

Using the virus testing example, if we have the situation whether we’ve performed testing on the indvidual at some timepoint \(t_{(1)}\) and the individual was negative. But then at a timepoint further on \(t_{(2)}\), the individual tested positive:

In this scenario, we know the individual was exposed to the virus sometime between t1 and t2, but we do not know the exact timing of the exposure.

Some definitions…

survival time: time until an event occurs. For example: time that an individual has “survived” over some follow-up period
survival probability: conditional probability of surviving beyond a time, given that an individual has survived just prior to that time.
failure: event of interest usually is death, disease incidence, or some other negative individual experience.
censoring: occurs if a subject has not experienced the event of interest by the end of the data collection. It is a type of missing data problem unique to survival analysis. We don’t know the survival time exactly because the study ends, a person is lost to follow-up or withdraws from the study…
right-censored: true survival time is equal to or greater than observed survival time
left-censored: true survival time is less than or equal to the observed survival time
interval-censored: true survival time is within a known time interval

Terminology: time = survival time; event = failure. We have two possible outcomes/status: event/failure (1) and censored (0).

Notations

T: random variable for a person’s survival time (T \(\geq\) 0)
t: specific of interest for the variable T
d: (0, 1) random variable, 1 if failure if the event occurs during the study period, 0 censored
S(t) = P(T \(>\) t): survival function, “probability that a person survives longer than some specified time t”. NB: S(0) = 1, S(\(\infty\)) = 0, S decreasing.

Theoretically, as t ranges from 0 up to infinity, the survivor function is graphed as a decreasing smooth curve, which begins at S(t) = 1 at t = 0 and heads downward toward zero as t increases toward infinity. In practice, using data, we usually obtain estimated survivor curves that are step functions, as illustrated here, rather than smooth curves.

h(t) = \(\lim\limits_{\Delta t\to 0} \frac{P(t\leq T < t + \Delta t | T \geq t )}{\Delta t}\): hazard function, "potential of failure in an infinitesimally small time period between t and t + \(\Delta t\) given that the subject has survived up till time t = P(individual fails in the interval [t, t + \(\Delta t\) | survival up to time t). In other words, the hazard function h(t) gives the instantaneous potential per unit time for the event to occur, given that the individual has survived up to time t. Note: not a density or a probability. always positive with no upper bound

Regardless of which function S(t) or h(t) one prefers, there is a clearly defined relationship between the two. In fact, if one knows the form of S(t), one can derive the corresponding h(t), and vice versa:

S(t) = \(exp[ - \int_{0}^{1} h(u) du]\) and h(t) = \(- [\frac{dS(t)/dt}{S(t)}]\)

The hazard rate indicates failure potential rather than survival probability. Thus, the higher the average hazard rate, the lower is the group’s probability of surviving.

Basic Data Layout for computer

This layout helps provide some understanding of how a survival analysis actually works and, in particular, how survivor curves are derived.

The first column in the table gives ordered survival times from smallest to largest. These are denoted by t’s with subscripts within parentheses, starting with \(t_{(0)}\), then \(t_{(1)}\) and so on, up to \(t_{(k)}\) where k is the number of distinct times at which subjects failed (\(k \leq n\)).

The second column gives frequency counts, denoted by \(m_f\), of failures at each distinct failure time.

The third column gives frequency counts, denoted by \(q_f\), of those persons censored in the time interval starting with failure time \(t_{(f)}\) up to the next failure time denoted \(t_{(f+1)}\). Technically, because of the way we have defined this interval in the table, we include those persons censored at the beginning of the interval.

The last column gives the risk set, \(R(t_{(f)})\), which denotes the collection of individuals who have survived at least to time \(t_{(f)}\).

Let’s look at an example of data derived from a study of the remission times in weeks for two groups of leukemia patients, with 21 patients in each group. Group 1 is the treatment group and group 2 is the placebo group. The basic question of interest concerns comparing the survival experience of the two groups.

For example, using the remission data for group 1, we find that 9 of the 21 persons failed, including 3 persons at 6 weeks and 1 person each at 7, 10, 13, 16, 22, and 23 weeks. These nine failures have \(k = 7\) distinct survival times, because three persons had survival time 6 and we only count one of these 6’s as distinct. The first ordered failure time for this group, denoted as \(t_{(1)}\), is 6; the second ordered failure time \(t_{(2)}\), is 7, and so on up to the seventh ordered failure time of 23.

The censored data includes 5 nonzero \(q\)’s: \(q_1 = 1\), \(q_2 = 1\), \(q_3 = 2\), \(q_5 = 3\), \(q_7 = 5\). Adding these values gives us the total number of censored observations for group 1, which is 12. Moreover, if we add the total number of \(q\)’s (12) to the total number of m’s (9), we get the total number of subjects in group 1, which is 21.

For the risk set, we see that at the start of the study everyone in group 1 survived at least 0 weeks, so the risk set at time 0 consists of the entire group of 21 persons. The risk set at 6 weeks for group 1 also consists of all 21 persons, because all 21 persons survived at least as long as 6 weeks. These 21 persons include the 3 persons who failed at 6 weeks, because they survived and were still at risk just up to this point.

Now let’s look at the risk set at 7 weeks. This set consists of 17 persons in group 1 that survived at least 7 weeks. We omit everyone in the X-ed area. Of the original 21 persons, we therefore have excluded the three persons who failed at 6 weeks and the one person who was censored at 6 weeks. These four persons did not survive at least 7 weeks. Although the censored person may have survived longer than 7 weeks, we must exclude him or her from the risk set at 7 weeks because we have information on this person only up to 6 weeks.

The Lung dataset

We use the lung dataset available from the survival package survival. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. It includes the 10 following variables:

inst: \(\qquad\) Institution code
time: \(\qquad\) Survival time in days
status: \(\qquad\) censoring status 1=censored, 2=dead
age: \(\qquad\) Age in years
sex: \(\qquad\) Male=1 Female=2
ph.ecog: \(\qquad\) ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound
ph.karno: \(\qquad\) Karnofsky performance score (bad=0-good=100) rated by physician
pat.karno: \(\qquad\) Karnofsky performance score (0 = bad, 100 = good) as rated by patient
meal.cal: \(\qquad\) Calories consumed at meals
wt.loss: \(\qquad\) Weight loss in last six months

data(lung)

head(lung, n=10) %>%
  kbl() %>%
  kable_styling()

inst	time	status	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss
3	306	2	74	1	1	90	100	1175	NA
3	455	2	68	1	0	90	90	1225	15
3	1010	1	56	1	0	90	90	NA	15
5	210	2	57	1	1	90	60	1150	11
1	883	2	60	1	0	100	90	NA	0
12	1022	1	74	1	1	50	80	513	0
7	310	2	68	2	2	70	60	384	10
11	361	2	71	2	2	60	80	538	1
1	218	2	53	1	1	70	80	825	16
7	166	2	61	1	2	70	70	271	34

data(lung)
t(introduce(lung))

##                       [,1]
## rows                   228
## columns                 10
## discrete_columns         0
## continuous_columns      10
## all_missing_columns      0
## total_missing_values    67
## complete_rows          167
## total_observations    2280
## memory_usage         21240

plot_intro(lung)

We have 2280 observations. All the variables are treated as continuous. Therefore, we transform the variable sex into a factor. We can also see that only 167/228 = 73% of the rows are complete

# re-code sex and status
lung <- lung %>% 
            mutate(sex=factor(sex, levels=c(1,2), labels=c("M", "F")),
                   had_event=ifelse(status==1, 0, 1))

Missing values

plot_missing(lung, group=c("Good"=1.0), theme_config=list(text = element_text(size = 16)))

The missing values are predominantly in meal.cal and wt.loss.

Count of unique values per column

lung %>% 
    mutate(sex=as.numeric(sex)) %>%
    pivot_longer(cols=colnames(lung)) %>% 
    group_by(name) %>% 
    summarize(unique_values=n_distinct(value))

## # A tibble: 11 x 2
##    name      unique_values
##    <chr>             <int>
##  1 age                  42
##  2 had_event             2
##  3 inst                 19
##  4 meal.cal             61
##  5 pat.karno             9
##  6 ph.ecog               5
##  7 ph.karno              7
##  8 sex                   2
##  9 status                2
## 10 time                186
## 11 wt.loss              54

Data desciption

descr(lung, transpose=TRUE, stats=c("min", "med", "mean", "max", "q1", "q3", "sd"))

## Descriptive Statistics  
## lung  
## N: 228  
## 
##                      Min   Median     Mean       Max       Q1        Q3   Std.Dev
## --------------- -------- -------- -------- --------- -------- --------- ---------
##             age    39.00    63.00    62.45     82.00    56.00     69.00      9.07
##       had_event     0.00     1.00     0.72      1.00     0.00      1.00      0.45
##            inst     1.00    11.00    11.09     33.00     3.00     16.00      8.30
##        meal.cal    96.00   975.00   928.78   2600.00   635.00   1150.00    402.17
##       pat.karno    30.00    80.00    79.96    100.00    70.00     90.00     14.62
##         ph.ecog     0.00     1.00     0.95      3.00     0.00      1.00      0.72
##        ph.karno    50.00    80.00    81.94    100.00    70.00     90.00     12.33
##          status     1.00     2.00     1.72      2.00     1.00      2.00      0.45
##            time     5.00   255.50   305.23   1022.00   166.50    399.00    210.65
##         wt.loss   -24.00     7.00     9.83     68.00     0.00     16.00     13.14

tmp = lung %>% 
        select(age, meal.cal, pat.karno, ph.karno, time, wt.loss) %>% 
        pivot_longer(everything())
ggplot(tmp, aes(x=value)) + 
  geom_histogram(aes(y=..density..), alpha=0.5) + 
  geom_density() + 
  facet_wrap(. ~ name, scales="free") +
  theme(text = element_text(size=20))

# compare censored and uncensored observations
tmp = lung %>% 
        select(had_event, age, meal.cal, pat.karno, ph.karno, time, wt.loss) %>% 
        pivot_longer(-one_of("had_event")) %>%
        mutate(had_event=ifelse(had_event==0,"Censored", "Died"))
ggplot(tmp, aes(x=value)) + 
  geom_density(aes(color=had_event)) + 
  facet_wrap(. ~ name, scales="free") +
  theme(text = element_text(size=20),
        legend.title = element_blank(),
        legend.position = "top")

# bar charts of discrete features
plot_bar(lung %>% 
           select(ph.ecog, sex, had_event) %>%
           mutate(ph.ecog=factor(ph.ecog, 
                                 levels=c(0, 1, 2, 3, 4), 
                                 labels=c("0-asymptomatic", "1-symptomatic", "2-in bed <50%", "3-in bed >50%", "4-bedbound"))), 
         theme_config=list(text = element_text(size = 20)), 
         order_bar=FALSE)

# compare censored and uncensored observations
tmp = lung %>% 
        select(had_event, sex, ph.ecog) %>% 
        mutate(had_event=ifelse(had_event==0,"Censored", "Died"),
               ph.ecog=factor(ph.ecog)) %>%
        pivot_longer(-one_of("had_event")) %>%
        group_by(had_event, name, value) %>%
        summarise(count=n())
        

ggplot(tmp, aes(x=value, y=count)) + 
  geom_bar(aes(fill=had_event), position="dodge", stat="identity") + 
  facet_wrap(. ~ name, scales="free") +
  theme(text = element_text(size=20),
        legend.title = element_blank(),
        legend.position = "top")

Average time-to-event/censor: 363 days: censored subjects, 283 days: dead subjects, right-skewed
72% of individuals in the study died
61% male, 39% female
Age is left-skewed; median = 63
Weight loss in the last 6 months with Median = 7 pounds, IQR is 0 to 16 pounds.
Performance scores at study start were good: The IQR of the Karnosky scores were 70-90, between “able to live at home” and “normal activity”). Most subjects had an ECOG score between “asymptomatic” and “symptomatic but ambulatory

Outlier summary

outlier_summary = diagnose_outlier(lung) %>% filter(outliers_cnt > 0)
outlier_var_iqr = data.frame(t(apply(lung[, outlier_summary$variables], 2, quantile, c(0.25, 0.75), na.rm=TRUE)))
outlier_var_iqr %>%
  rename(Q25="X25.", Q75="X75.") %>%
  rownames_to_column("variables") %>%
  left_join(outlier_summary, by="variables") %>%
  select(variables, Q25, Q75, outliers_mean, outliers_cnt, with_mean, without_mean) %>%
  mutate(across(where(is.numeric), round, 2))

##   variables    Q25     Q75 outliers_mean outliers_cnt with_mean without_mean
## 1      time 166.75  396.50        871.70           10    305.23       279.25
## 2   ph.ecog   0.00    1.00          3.00            1      0.95         0.94
## 3  ph.karno  75.00   90.00         50.00            6     81.94        82.81
## 4 pat.karno  70.00   90.00         30.00            2     79.96        80.40
## 5  meal.cal 635.00 1150.00       2410.00            5    928.78       886.70
## 6   wt.loss   0.00   15.75         57.25            4      9.83         8.93

# use dlookr to examine outliers
plot_outlier(lung,
             diagnose_outlier(lung) %>%
                filter(outliers_cnt > 0) %>%
                select(variables) %>%
                unlist())

box_data <- lung %>% 
              select(outlier_summary$variables) %>% 
              pivot_longer(everything()) %>% 
              drop_na()
ggplot(box_data, aes(x=value)) + 
  geom_boxplot() + 
  facet_wrap(.~name, scales="free") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.x = element_blank(),
        text = element_text(size=20))

5 individuals’ calories at meals average 2,410 calories/meal, compared to an average overall of 929 calories/meal.
Some subjects lost more than 50 pounds in the last 6 months; 1 individual had gained weight.
Outliers in time-to-event survived, on average, 871 days, as compared to 305 days on average across all subjects

Normality distribution of each variables

normality(lung)

## # A tibble: 10 x 4
##    vars      statistic  p_value sample
##    <chr>         <dbl>    <dbl>  <dbl>
##  1 inst          0.919 8.23e-10    228
##  2 time          0.917 5.11e-10    228
##  3 status        0.559 1.14e-23    228
##  4 age           0.982 4.83e- 3    228
##  5 ph.ecog       0.818 1.42e-15    228
##  6 ph.karno      0.909 1.66e-10    228
##  7 pat.karno     0.920 1.13e- 9    228
##  8 meal.cal      0.905 2.14e- 9    228
##  9 wt.loss       0.915 9.65e-10    228
## 10 had_event     0.559 1.14e-23    228

plot_normality(lung)

We examine normality, but our variables are all skewed.

Correlations

# dlookr: correlation heatmap
data_for_cor <- lung %>% 
                  select(age, ph.ecog, ph.karno, pat.karno, meal.cal, wt.loss, time) %>% 
                  replace(is.na(.), 0)

corrs = cor(data_for_cor)
corrplot(corrs, type="upper", method="color", addCoef.col = "black")

We can observe a strong correlation between ph.ecog and ph.karno, and the correlation between ph.ecog and pat.karno.
The age is associated with worse performance scores, lower meal calories, more weight loss, and shorter time-to-event.
Good performance scores are associated with more calories at meals, less weight loss, and longer times-to-event

Time effects

scattervars <- c("inst", "time", "age", "ph.ecog", "ph.karno", "pat.karno", "meal.cal", "wt.loss")
time_df <- lung %>% 
                select(all_of(scattervars)) %>%
                pivot_longer(cols=-time, names_to="variable")

ggplot(time_df, aes(x=time, y=value)) +
    geom_jitter() +
    facet_wrap(vars(variable), nrow=2, scales="free")

Survival object

‘Surv’ is the main function used to create the survival object, usually used as a response variable in a model formula. It takes as an input the time and event.

Surv(lung$time, lung$status)[1:10]

##  [1]  306   455  1010+  210   883  1022+  310   361   218   166

Kaplan-Meier model

Kaplan-Meier General model

The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.

The remaining survival probabilities are computed such that we count the number of subjects surviving past the specified time being considered and divide by the number of subjects at the start of follow-up.

\[\hat{S}(t_{f}) = \hat{S}(t_{f-1}) \hat{P}r(T \geq t_{(f)} | T \geq t_{(f)} ) = \prod_{i=1}^{f} \hat{P}r(T \geq t_{(i)} | T \geq t_{(i)} )\]

This formula gives the probability of surviving past the previous failure time \(t_{f-1}\), multiplied by the conditional probability of surviving past time \(t_{f}\), given survival to at least time \(t_{f}\). Each term in the product is the probability of exceeding a specific ordered failure time \(t_{f}\) given that a subject survives up to that failure time.

Example

Looking at the previous example, we can calculate the survival time from the Kaplan-Meier model.

The first survival estimate on the list is \(\hat{S}(0) = 1\), as it will always be, because this gives the probability of surviving past time zero.

The other survival estimates are calculated by multiplying the estimate for the immediately preceding failure time by a fraction. For example, the fraction is 18/21 for surviving past week 6, because 21 subjects remain up to week 6 and 3 of these subjects fail to survive past week 6. The fraction is 16/17 for surviving past week 7, because 17 people remain up to week 7 and 1 of these fails to survive past week 7. The other fractions are calculated similarly.

Compute the Kaplan-Meier model

With the survfit() function, we create a simple survival curve that doesn’t consider any different groupings, so we’ll specify just an intercept (e.g., ~1) in the formula that survfit expects.

fit_overall = survfit(Surv(time, event=had_event) ~ 1, data=lung)
print(fit_overall)

## Call: survfit(formula = Surv(time, event = had_event) ~ 1, data = lung)
## 
##       n  events  median 0.95LCL 0.95UCL 
##     228     165     310     285     363

By default, the function print() shows a short summary of the survival curves. It prints the number of observations, number of events, the median survival and the confidence limits for the median. We can note that the median survival time is 310 days.

The function survfit() returns a list of variables, including the following components:

n: total number of subjects in each curve.
time: the time points on the curve.
n.risk: the number of subjects at risk at time t
n.event: the number of events that occurred at time t.
n.censor: the number of censored subjects, who exit the risk set, without an event, at time t.
lower,upper: lower and upper confidence limits for the curve, respectively.

# creates the survival table 
f <- summary(fit_overall)
df_overall_fit <- data.frame(f$time, f$n.risk, f$n.event, f$n.censor, f$surv, f$lower, f$upper)
names(df_overall_fit) <- c("time", "n.risk", "n.event", "n.censor", "survival", "ci_95_lower", "ci_95_upper")
head(df_overall_fit, n=10)

##    time n.risk n.event n.censor  survival ci_95_lower ci_95_upper
## 1     5    228       1        0 0.9956140   0.9870734   1.0000000
## 2    11    227       3        0 0.9824561   0.9655619   0.9996460
## 3    12    224       1        0 0.9780702   0.9592437   0.9972662
## 4    13    223       2        0 0.9692982   0.9471630   0.9919508
## 5    15    221       1        0 0.9649123   0.9413217   0.9890941
## 6    26    220       1        0 0.9605263   0.9355811   0.9861367
## 7    30    219       1        0 0.9561404   0.9299253   0.9830945
## 8    31    218       1        0 0.9517544   0.9243423   0.9799794
## 9    53    217       2        0 0.9429825   0.9133598   0.9735659
## 10   54    215       1        0 0.9385965   0.9079467   0.9702809

We use the function ggsurvplot() [in Survminer R package] to produce the survival curves for the two groups of subjects.

ggsurvplot(fit_overall, 
            xlab="Days", 
            ylab="Overall survival probability",
            risk.table=TRUE,
            conf.int=TRUE,
            surv.median.line="hv")  # draw horizontal AND vertical line for median

Using Variables: Kaplan Meier model with sex

we model something besides just an intercept. Let’s fit survival curves separately by sex.

# visualize (KM plot) effect of sex

# median survival time for men = 270 days, vs. 426 days for women
km_sex = survfit(Surv(time, event=had_event) ~ strata(sex), data=lung)
print(km_sex)

## Call: survfit(formula = Surv(time, event = had_event) ~ strata(sex), 
##     data = lung)
## 
##                 n events median 0.95LCL 0.95UCL
## strata(sex)=M 138    112    270     212     310
## strata(sex)=F  90     53    426     348     550

ggsurvplot(survfit(Surv(time, event=had_event) ~ sex, data=lung), # survival model
           xlab="Days", 
           ylab="Overall survival probability",
           pval = TRUE, # displays p-value of log-rank test of the difference between the two curves
           risk.table=TRUE,
           conf.int=TRUE,  # plots a confidence interval for each curve
           surv.median.line="hv")

## Warning: Vectorized input to `element_text()` is not officially supported.
## Results may be unexpected or may change in future versions of ggplot2.

ggsurvplot(
  survfit(Surv(time, event=had_event) ~ sex, data=lung),                     #survival model we want to plot 
  pval = TRUE,              #displays p-value of log-rank test, if p-value < 0.05, then the difference between the two curves are statistically significant
  conf.int = TRUE,          #plots a confidence interval for each curve
  xlab = "Time in days",
  break.time.by = 150,      # break X axis in time intervals by 100.
  ggtheme = theme_light(),  # customize theme with a grid for better readability 
  risk.table = "abs_pct",   # absolute number and percentage at risk
  risk.table.y.text.col = T,# colour risk table text annotations
  risk.table.y.text = FALSE,# show bars instead of names in text annotations
                            # in legend of risk table.
  ncensor.plot = TRUE,      # plot the number of censored subjects at time t
  surv.median.line = "hv",   # add the median survival pointer
  legend.labs=c("Male", "Female"), legend.title="Sex",  
  palette=c("dodgerblue2", "orchid2"), 
  title="Kaplan-Meier Curve for Lung Cancer Survival by sex", 
  risk.table.height=.15
)

## Warning: Vectorized input to `element_text()` is not officially supported.
## Results may be unexpected or may change in future versions of ggplot2.

Comparing Survival Curves

The log-rank test can be used to evaluate whether or not KM curves for two or more groups are statistically equivalent. The null hypothesis is that there is no difference in survival between the groups.

The log rank test is a non-parametric test, which makes no assumptions about the survival distributions. It is approximately distributed as a chi-square test statistic.

We consider the previous example.

We have: * \(m_{if}\) = observed counts in group i at time f * \(e_^{if}\) = expected counts in group i at time f

The null hypothesis being tested is that there is no overall difference between the two survival curves. Under this null hypothesis, the log-rank statistic is approximately chi-square with one degree of freedom. Thus, a p-value for the log–rank test is determined from tables of the chi-square distribution.

For two-group case, the log–rank test statistic is formed using the sum of the observed minus expected counts over all failure times for one of the groups:

\[X^{2} = \frac{(O_{i} - E_{i})^{2}}{V(O_{i} - E_{i})} \sim \chi^{2}\]

where: \(\#\) of failure times for the group i is \(O_{i} - E_{i} = \sum_{f=1}^{27} (m_{if} - e_{if})\) and \(e_1 = (\frac{n_{1f}}{n_{1f} + n_{2f}}) x (m_{1f} + m_{2f})\), \(e_2 = (\frac{n_{2f}}{n_{1f} + n_{2f}}) x (m_{1f} + m_{2f})\).

Therefore, we have \(O_{1} - E_{1} = -10.26\) and \(O_{2} - E_{2} = 10.26\).

For two groups, the variance formula is the same for each group:

\[V(O_{i} - E_{i}) = \sum_j \frac{n_{1f}n_{2f}(m_{1f} + m_{2f})(n_{1f} + n_{2f} - m_{1f} - m_{2f})}{(n_{1f} + n_{2f})(n_{1f} + n_{2f} - 1)}\] We will use the group 2 value to carry out the test, but as we can see, except for the minus sign, the difference is the same for the two groups.

Therefore, the estimated variance of O2 - E2 is computed from the variance formula above to be 6.2685. The log–rank statistic then is obtained by squaring 10.26 and dividing by 6.285, which yields 16.793, as shown on the computer printout.

\[X^{2} = \frac{(O_{i} - E_{i})^{2}}{\hat{V}(O_{i} - E_{i})} = \frac{(10.26)^2}{ 6.2685} = 16.793\].

An approximation to the log–rank statistic, shown here, can be calculated using observed and expected values for each group without having to compute the variance formula.

\[X^{2} = \sum_{i}^{\# of groups} \frac{(O_{i} - E_{i})^{2}}{E_{i}}\]

In our case, we have:

\[X^{2} = \frac{(-10.26)^2}{ 19.26} + \frac{(10.26)^2}{ 10.74} = 15.276\]

The calculation of the approximate formula is shown here for the remission data. The expected values are 19.26 and 10.74 for groups 1 and 2, respectively. The chi-square value obtained is 15.276, which is slightly smaller than the log–rank statistic of 16.793.

Although the same tabular layout can be used to carry out the calculations when there are more than two groups, the test statistic is more complicated mathematically, involving both ariances and covariances of summed observed minus expected scores for each group.

The function survdiff() [in survival package] can be used to compute log-rank test comparing males and females survival curves.

survdiff(Surv(time, had_event) ~ sex, lung)

## Call:
## survdiff(formula = Surv(time, had_event) ~ sex, data = lung)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=M 138      112     91.6      4.55      10.3
## sex=F  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

The log rank test for difference in survival gives a p-value of p = 0.001, indicating that the sex groups differ significantly in survival.

There are several alternatives to the log rank test designed to test the hypothesis that two or more survival curves are equivalent called the Wilcoxon, the Tarone-Ware, the Peto, and the Flemington-Harrington test. These test statistics are variations of the log rank test statistic and are derived by applying different weights at the f-th failure time (as shown on the left for two groups).

Transforming the survival curve

There is another major argument of the ggsurvplot() function, “fun”, which proposes alternative transformations:

1- “log”: log transformation of the survivor function

ggsurvplot(survfit(Surv(time, event=had_event) ~ sex, data=lung),
          conf.int = TRUE,
          ggtheme = theme_bw(), 
          fun = "log")

2- “event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence.

ggsurvplot(survfit(Surv(time, event=had_event) ~ sex, data=lung),
          conf.int = TRUE,
          ggtheme = theme_bw(), 
          fun = "event")

3- “cumhaz” plots the cumulative hazard function (f(y) = -log(y))

The cumulative hazard (H(t)) can be interpreted as the cumulative force of mortality. In other words, it corresponds to the number of events that would be expected for each individual by time t if the event were a repeatable process.

ggsurvplot(survfit(Surv(time, event=had_event) ~ sex, data=lung),
          conf.int = TRUE,
          ggtheme = theme_bw(), 
          fun = "cumhaz")

Conditional survival curves

It can be useful to generate survival estimates among a group of patients who have already survived for some length of time.

\[S(y|x) = \frac{S(x + y)}{S(x)}\] where:

y: number of additional survival years of interest
x: number of years a patient has already survived

We can compute the conditional survival estimates with the package condsurv and generate estimates and plots related to conditional survival.

We can use the conditional_surv_est function to get estimates and 95% confidence intervals. Let’s condition on survival to 6-months.

fit_overall = survfit(Surv(time, event=had_event) ~ 1, data=lung)
surv_times <- c(0, 0.5*365, 365, 1.5*365, 2*365)

gg_conditional_surv(
  basekm = fit_overall,  # intercept-only model 
  at = surv_times,
  main = "Conditional survival in lung data",
  xlab = "Days"
  ) +
  labs(color = "Conditional time")

The resulting plot has one survival curve for each time on which we condition. In this case the first line is the overall survival curve since it is conditioning on time 0.

conditional_surv_est(fit_overall, t1=0, t2=2*365)

## $cs_est
## [1] 0.12
## 
## $cs_lci
## [1] 0.07
## 
## $cs_uci
## [1] 0.18

surv_times <- c(0.25*365, 0.5*365, 365, 1.5*365, 2*365)

# unconditional survival probabilities (at 3/6/12/18/24 months)
purrr::map_df(
  surv_times, 
  ~conditional_surv_est(
    basekm = fit_overall, 
    t1 = 0, 
    t2 = .x) 
  ) %>% 
  mutate(months = round(surv_times / 30.4)) %>% 
  select(months, everything())

## # A tibble: 5 x 4
##   months cs_est cs_lci cs_uci
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1      3   0.88   0.83   0.92
## 2      6   0.71   0.64   0.76
## 3     12   0.41   0.34   0.48
## 4     18   0.26   0.19   0.32
## 5     24   0.12   0.07   0.18

# 3-month conditional survival probabilities (at 6/12/18/24 months)
purrr::map_df(
  surv_times[-1], 
  ~conditional_surv_est(
    basekm = fit_overall, 
    t1 = 0.25*365, 
    t2 = .x) 
  ) %>% 
  mutate(months = round(surv_times[-1] / 30.4)) %>% 
  select(months, everything())

## # A tibble: 4 x 4
##   months cs_est cs_lci cs_uci
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1      6  0.8     0.74   0.85
## 2     12  0.46    0.39   0.54
## 3     18  0.290   0.22   0.37
## 4     24  0.13    0.08   0.2

# 6-month conditional survival probabilities (at 12/18/24 months)
purrr::map_df(
  surv_times[-1:-2], 
  ~conditional_surv_est(
    basekm = fit_overall, 
    t1 = 0.5*365, 
    t2 = .x) 
  ) %>% 
  mutate(months = round(surv_times[-1:-2] / 30.4)) %>% 
  select(months, everything())

## # A tibble: 3 x 4
##   months cs_est cs_lci cs_uci
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1     12  0.580   0.49   0.66
## 2     18  0.36    0.27   0.45
## 3     24  0.16    0.1    0.25

# 12-month survival probabilities (at 18/24 months)
purrr::map_df(
  surv_times[-1:-3], 
  ~conditional_surv_est(
    basekm = fit_overall, 
    t1 = 365, 
    t2 = .x) 
  ) %>% 
  mutate(months = round(surv_times[-1:-3] / 30.4)) %>% 
  select(months, everything())

## # A tibble: 2 x 4
##   months cs_est cs_lci cs_uci
##    <dbl>  <dbl>  <dbl>  <dbl>
## 1     18   0.62   0.49   0.73
## 2     24   0.28   0.17   0.41

Cox PH model

Kaplan-Meier curves are good for visualizing differences in survival between two categorical groups, and the log-rank test you get when you ask for pval = TRUE is useful for asking if there are differences in survival between different groups. But this doesn’t generalize well for assessing the effect of quantitative variables.

Another method, called Cox PH regression, can assess the effect of both categorical and continuous variables, and can model the effect of multiple variables at once.

Cox PH General model

The Cox model is expressed by the hazard function denoted by \(h(t)\). It is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes. The hazard function can be interpreted as the risk of dying at time t. It can be estimated as follow:

\(h(t, X_{i}) = h_{0}(t)e^{ \sum_{j=1}^{p} \beta_{j}X_{i,j}} = h_{0}(t)exp(\beta_{1}X_{i,1} + ... +\ beta_{p}X_{i,p})\)

where:

\(h(t)\) is the hazard, the instantaneous rate at which events occur.
\(h_{0}(t)\) is called the baseline hazards (when all X’s are equal to 0), depends on \(t\)
\(X = (X_{1}, X_{2},..., X_{p})\) explanatory/predictor variables
\(e^{ \sum_{i=1}^{p} \beta_{i}X_{i}}\), depends only on X’s, called .

Because the baseline hazard \(h_{0}(t)\) is an unspecified function, the Cox model us a semiparametric model.

Advantages of the model: “robust” model, so that the results from using the Cox model will closely approximate the results for the correct parametric model.

The Cox model can be written as a multiple linear regression of the logarithm of the hazard on the variables \(X_{i}\), with the baseline hazard, \(h_{0}(t)\), being an ‘intercept’ term that varies with time.

\[log(h(t, X_{i})) = log(h_{0}(t)) + \sum_{j=1}^{p} \beta_{j}X_{i,j}\]

We can compute the hazard ratio, which is the ratio of hazards between two groups at any particular point in time: “hazard for one individual divided by the hazard for a different individual”.

\[\hat{HR} = \frac{\hat{h}(t, X^{*})}{\hat{h}(t, X)} = e^{ \sum_{i=1}^{p} \beta_{i} (X^{*}_{i} - X_{i})}\]

with:

X\(^{*}\): set of predictors for one individual

X: set of predictors for the other individual

This model shows that the hazard ratio is equal to \(e^{ \sum_{i=1}^{p} \beta_{i} (X^{*}_{i} - X_{i})}\), and remains constant over time t (hence the name proportional hazards regression). In this sense, we do not need the baseline hazard because we can interpret coefficients as hazard ratios.

A hazard ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

HR=1 : No effect
HR<1: Reduction in the hazard
HR>1: Increase in Hazard

As a note, in cancer studies, a covariate with hazard ratio :

greater than 1 (i.e.: b>0) is called bad prognostic factor.
smaller than 1 (i.e.: b<0) is called good prognostic factor.

As a consequence, a major assumption of this model is that the HR is constant over time because it is independent of time. Or equivalently that the hazard for one individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time.

It is possible, nevertheless, to consider X’s which do involve t. Such X’s are called time-dependent variables. If time-dependent variables are considered, the Cox model form may still be used, but such a model no longer satisfies the PH assumption, and is called the extended Cox model.

Compute the Cox Model

The coxph() function uses the same syntax as lm(), glm(), etc. The response variable you create with Surv() goes on the left hand side of the formula, specified with a ~. Explanatory variables go on the right side.

COX PH model with sex variable only

cox_sex = coxph(Surv(time, had_event) ~ sex, data=lung)
print(cox_sex)

## Call:
## coxph(formula = Surv(time, had_event) ~ sex, data = lung)
## 
##         coef exp(coef) se(coef)      z       p
## sexF -0.5310    0.5880   0.1672 -3.176 0.00149
## 
## Likelihood ratio test=10.63  on 1 df, p=0.001111
## n= 228, number of events= 165

The effect of sex is significantly related to survival (p-value = 0.00149), with better survival in females in comparison to males (hazard ratio of dying = 0.588).

The model is statistically significant. That 0.00111 p-value of the model is really close to the p = 0.00131 p-value we saw on the Kaplan-Meier nodel as well as the likelihood ratio test = 10.63 is close to the log-rank chi-square (10.3) in the Kaplan-Meier model.

\(e^{\beta_{1}}\) = \(e^{-0.531}\) = 0.5880 is the hazard ratio - the multiplicative effect of that variable on the hazard rate (for each unit increase in that variable). Females have 0.588 (~ 60%) times the hazard of dying in comparison to males. So, for a categorical variable like sex, going from male (baseline) to female results in approximately ~40% reduction in hazard.

We could also flip the sign on the coef column, and take \(e^{0.531}\) = 1/0.588 = 1.7, which we can interpret as being male resulting in a 1.7-fold increase in hazard, or that males die as approximately 1.7 x the rate per unit time as females (females die at 0.588x the rate per unit time as males).

We can directly calculate the log-rank test p-value using survdiff(). Cox regression and the logrank test from survdiff will to give you similar results most of the time. The log-rank test is asking if survival curves differ significantly between two groups. Cox regression is asking which of many categorical or continuous variables significantly affect survival.

Recall:

The log-rank test tests the following hypothesis:

H0: There is no difference in the survival function when comparing males to females after adjusting for age.

Ha: There is a difference in the survival function when comparing males to females after adjusting for age.

survdiff(Surv(time, status)~sex, data=lung)

## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = lung)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=M 138      112     91.6      4.55      10.3
## sex=F  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

Validity of the Cox PH model

The Cox proportional hazards model makes several assumptions. We use residuals methods to:

check the proportional hazards assumption with the Schoenfeld residuals
detect nonlinearity in relationship between the log hazard and the covariates using Martingale residual
examining influential observations (or outliers) with deviance residual (symmetric transformation of the martinguale residuals), to examine influential observations

Testing proportional hazard

The proportional hazard assumption is supported by a non-significant relationship between residuals and time, and refuted by a significant relationship.

We can test with the Goodness of Fit (GOF) approach based on the residuals defined by Schoenfeld.

The idea behind the statistical test is that if the PH assumption holds for a particular covariate then the Schoenfeld residuals for that covariate will not be related to survival time.

The implementation of the test can be thought of as a three-step process.

Step 1. Run a Cox PH model and obtain Schoenfeld residuals for each predictor.
Step 2. Create a variable that ranks the order of failures. The subject who has the first (earliest) event gets a value of 1, the next gets a value of 2, and so on.
Step 3. Test the correlation between the variables created in the first and second steps. The null hypothesis is that the correlation between the Schoenfeld residuals and ranked failure time is zero.

For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole.

test.ph <- cox.zph(cox_sex)
print(test.ph)

##        chisq df     p
## sex     2.86  1 0.091
## GLOBAL  2.86  1 0.091

ggcoxzph(test.ph)

From the output above, the test is not statistically significant, and therefore the global test is also not statistically significant. Therefore, we can assume the proportional hazards.

In the graphical diagnostic using the function ggcoxzph() [in the survminer package], the solid line is a smoothing spline fit to the plot, with the dashed lines representing a +/- 2-standard-error band around the fit. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates sex (which is, recall, a two-level factor, accounting for the two bands in the graph).

Another approach is to graphically check the PH assumption by comparing -log–log survival curves. A log–log survival curve is simply a transformation of an estimated survival curve that results from taking the natural log of an estimated survival probability twice.

\(h(t, X) = h_{0}(t)e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}\) which is equivalent to \(S(t,X) = [S_{0}(t)]^{e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}}\)

Therefore, \[-ln(-ln(S(t, X)))\]

\[= -ln(-ln([S_{0}(t)]^{e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}}))\] \[= -ln[-ln(S_{0}(t))] - ln[e^{ \sum_{j=1}^{p} \beta_{j}X_{j}}]\]

\[= - \sum_{j=1}^{p} \beta_{j} X_{j} - ln[-ln(S_{0}(t))]\]

Therefore, the corresponding log–log curves for these individuals are given as shown here, where we have simply substituted \(X_1\) and \(X_2\) for \(X\) in the previous expression for the log–log curve for any individual \(X\).

\[ln(-ln(S(t, X_1))) - ln(-ln(S(t, X_2)))\] \[= \sum_{j=1}^{p} \beta_{j} (X_{1j} - X_{2j})\]

The baseline survival function has dropped out, so that the difference in log–log curves involves an expression that does not involve time t. The above formula says that if we use a Cox PH model and we plot the estimated -log–log survival curves for two groups of individuals on the same graph, the two plots would be approximately parallel.

The distance between the two curves is the linear expression involving the differences in predictor values, which does not involve time. Note, in general, if the vertical distance between two curves is constant, then the curves are parallel.

m = survfit(Surv(time, had_event) ~ sex, lung)
s = summary(m)
s_table = data.frame(s$strata, s$time, s$n.risk, s$n.event, s$n.censor, s$surv, s$lower, s$upper)
s_table = s_table %>%
              rename(strata=s.strata, time=s.time, surv=s.surv, lower=s.lower, upper=s.upper) %>%
              mutate(negloglogsurv=-log(-log(surv)))

ggplot(s_table, aes(x=time, y=negloglogsurv, color=strata)) + 
  geom_line(size=1.25) +
  theme(text=element_text(size=16),
        plot.title=element_text(hjust=0.5)) +
  ylab("-ln(-ln(S(t)))") +
  ggtitle("-log-log plot of survival time by sex")

The two curve do not cross, therefore this result suggests that the two groups in sex satisfy the PH assumption.

Testing influential observations

To test influential observations or outliers, we can visualize either: the dfbeta values or the deviance residuals.

1. dfbeta values

This plot produces the estimated changes in the coefficients divided by their standard errors. The comparison of the magnitudes of the largest dfbeta values to the regression coefficients suggests that none of the observations is terribly influential individually.

ggcoxdiagnostics(cox_sex, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

1. deviance residuals

The deviance residual is a normalized transformation of the martingale residual. These residuals should be roughly symmetrically distributed about zero with a standard deviation of 1.

Positive values correspond to individuals that “died too soon” compared to expected survival times.
Negative values correspond to individual that “lived too long”.
Very large or small values are outliers, which are poorly predicted by the model.

ggcoxdiagnostics(cox_sex, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

The pattern looks fairly symmetric around 0 for the variable sex.

Comparison of the KM model and COX PH model with the variable sex

We can then compare the survival curves from KM (km_sex) and Cox PH model (cox_sex).

km_sex = survfit(Surv(time, event=had_event) ~ strata(sex), data=lung)
km.curve <- getKMcurve(km = km_sex, 
                       time.col = 'time', 
                       event.col = 'had_event', 
                       group.col = 'sex', 
                       data = lung)
km.curve = km.curve %>% 
            select(group, Time, Survival) %>% 
            rename(sex=group, time=Time, surv=Survival) %>%
            mutate(sex=factor(sex, levels=c("M", "F")))
            
cox.curve = data.frame(sex=factor(lung$sex, levels=c("M", "F")),
                       time=lung$time,
                       surv=exp(-predict(cox_sex, type="expected")))
ggplot() +
  geom_line(data=km.curve, aes(x=time, y=surv, color=sex), size=0.25) +
  geom_line(data=cox.curve, aes(x=time, y=surv, color=sex), size=1, linetype="dashed") +
  ggtitle("Kaplan-Meier vs. Cox Model") +
  ylab("Survival Probability") + 
  theme(title=element_text(hjust=.5),
        text=element_text(size=20))

COX PH model with age variable only

cox_age = coxph(Surv(time, had_event) ~ age, data=lung)
print(cox_age)

## Call:
## coxph(formula = Surv(time, had_event) ~ age, data = lung)
## 
##         coef exp(coef) se(coef)     z      p
## age 0.018720  1.018897 0.009199 2.035 0.0419
## 
## Likelihood ratio test=4.24  on 1 df, p=0.03946
## n= 228, number of events= 165

The effect of age is significantly related to survival (p-value = 0.03946). \(e^{\beta_{1}}\) = \(e^{0.018720}\) = 1.018897 is the hazard ratio - the multiplicative effect of that variable on the hazard rate (for each unit increase of age). We have a better survival in younger people (hazard ratio of dying = 1.018897).

test.ph <- cox.zph(cox_age)
ggcoxzph(test.ph)

ggcoxdiagnostics(cox_age, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

ggcoxdiagnostics(cox_age, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

ggcoxdiagnostics(cox_age, type = "martingale",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

We do not observe any pattern in time, none of the observations is terribly influential individually.

The last plot test the linearity by plotting the martingale residuals against continuous covariates. The age is slightly non-linear.

Smooth survival plot - quantile of survival of age

We want to visualize the survival estimate according to a continuous variable, age. The sm.survival function from the sm package allows to do this for a quantile of the distribution of survival data. The default quantile is p = 0.5 for median survival.

The x’s represent events, the o’s represent censoring events and the line is a smoothed estimate of a qunatile survival according to age.

sm.options(
  list(
    xlab = "Age (years)",
    ylab = "Time to death (years)")
  )

layout(matrix(c(1,2,3), ncol=3))

#par(mar=c(5.1, 4.1, 4.1, 2.1))
par(pty="s")  # make plot square

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = 1 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .5,
  add=FALSE
)

## Warning in plot.window(...): "add" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "add" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in box(...): "add" is not a graphical parameter

## Warning in title(...): "add" is not a graphical parameter

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = 1 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .25,
  add=TRUE
)

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = 1 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .75,
  add=TRUE
)

title('1*h, quantiles .25, .5, .75')



par(pty="s")  # make plot square
sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .5 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .5,
  add=FALSE
)

## Warning in plot.window(...): "add" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "add" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in box(...): "add" is not a graphical parameter

## Warning in title(...): "add" is not a graphical parameter

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .5 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .25,
  add=TRUE
)

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .5 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .75,
  add=TRUE
)

title('.5*h, quantiles .25, .5, .75')



par(pty="s")  # make plot square
sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .25 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .5,
  add=FALSE
)

## Warning in plot.window(...): "add" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "add" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in box(...): "add" is not a graphical parameter

## Warning in title(...): "add" is not a graphical parameter

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .25 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .25,
  add=TRUE
)

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .25 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .75,
  add=TRUE
)

title('.25*h, quantiles .25, .5, .75')

sm.options(
  list(
    xlab = "Age (years)",
    ylab = "Time to death (days)")
  )

#par(mar=c(5.1, 4.1, 4.1, 2.1))
par(pty="s")  # make plot square

par(pty="s")  # make plot square
sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .5 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .5,
  add=FALSE
)

## Warning in plot.window(...): "add" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "add" is not a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "add" is not a
## graphical parameter

## Warning in box(...): "add" is not a graphical parameter

## Warning in title(...): "add" is not a graphical parameter

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .5 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .25,
  add=TRUE
)

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .5 * sd(lung$age) / nrow(lung)^(-1/4),
  p = .75,
  add=TRUE
)

title('Median survival by age, with IQR')

In this case, the line is too smooth so we add the option h is the smoothing parameter. This should be related to the standard deviation of the continuous covariate, x. Suggested to start with \(\frac{sd(x)}{n^{−1/4}}\) then reduce by 1/2, 1/4, etc to get a good amount of smoothing. The previous plot was too smooth so let’s reduce it by 1/4.

sm.survival(
  x = lung$age,
  y = lung$time,
  status = lung$status,
  h = .25 * sd(lung$age) / nrow(lung)^(-1/4)
  
)

Transform age as a categorical variable

We split the age variable into 4 categories: younger than 50, 51-60, 61-70, older than 70.

lung_cat = lung %>% 
  mutate(age_grp = cut(lung$age, breaks=c(-Inf, 50, 60, 70, Inf), labels=c("<=50", "51-60", "61-70", ">70")))

cox_age_cat = coxph(Surv(time, had_event) ~ age_grp, data=lung_cat)
print(cox_age_cat)

## Call:
## coxph(formula = Surv(time, had_event) ~ age_grp, data = lung_cat)
## 
##                coef exp(coef) se(coef)     z      p
## age_grp51-60 0.1741    1.1901   0.2895 0.601 0.5477
## age_grp61-70 0.2268    1.2546   0.2805 0.809 0.4187
## age_grp>70   0.5731    1.7737   0.2992 1.916 0.0554
## 
## Likelihood ratio test=4.94  on 3 df, p=0.176
## n= 228, number of events= 165

print(paste0("cox_age AIC = ", round(aic(cox_age), 1)))

## [1] "cox_age AIC = 1497.6"

print(paste0("cox_age_cat AIC = ", round(aic(cox_age_cat), 1)))

## [1] "cox_age_cat AIC = 1500.9"

cox.zph(cox_age_cat)

##         chisq df    p
## age_grp  1.96  3 0.58
## GLOBAL   1.96  3 0.58

The model is not signicant and neither of the variables are. We split the age in only two groups: younger than 70 and older than 70.

lung_cat = lung %>% 
  mutate(age_grp = cut(lung$age, breaks=c(-Inf, 69, Inf), labels=c("39-69", "70+")))

cox_age_70 = coxph(Surv(time, had_event) ~ age_grp, data=lung_cat)
print(cox_age_70)

## Call:
## coxph(formula = Surv(time, had_event) ~ age_grp, data = lung_cat)
## 
##              coef exp(coef) se(coef)     z      p
## age_grp70+ 0.3490    1.4177   0.1744 2.002 0.0453
## 
## Likelihood ratio test=3.8  on 1 df, p=0.0513
## n= 228, number of events= 165

print(paste0("cox_age AIC = ", round(aic(cox_age), 1)))

## [1] "cox_age AIC = 1497.6"

print(paste0("cox_age_70 AIC = ", round(aic(cox_age_70), 1)))

## [1] "cox_age_70 AIC = 1498"

cox.zph(cox_age_70)

##            chisq df p
## age_grp 1.04e-05  1 1
## GLOBAL  1.04e-05  1 1

The AIC is about the same as the model with a continuous variable. We test the proportional hazards assumption for age.

m = survfit(Surv(time, had_event) ~ age_grp, lung_cat)
s = summary(m)
s_table = data.frame(s$strata, s$time, s$n.risk, s$n.event, s$n.censor, s$surv, s$lower, s$upper)
s_table = s_table %>%
              rename(strata=s.strata, time=s.time, surv=s.surv, lower=s.lower, upper=s.upper) %>%
              mutate(negloglogsurv=-log(-log(surv)))

ggplot(s_table, aes(x=time, y=negloglogsurv, color=strata)) + 
  geom_line(size=1.25) +
  theme(text=element_text(size=16),
        plot.title=element_text(hjust=0.5)) +
  ylab("-log(-log(S(t)))") +
  ggtitle("-log-log plot of survival time by age")

We can also look at the Kaplain-Meier curve with fir with the same variable.

ggsurvplot(survfit(Surv(time, event=had_event) ~ age_grp, data=lung_cat), 
           xlab="Days", 
           ylab="Overall survival probability",
           pval = TRUE,
           risk.table=FALSE,
           conf.int=TRUE,
           surv.median.line="hv")

We can then compare the survival curves from KM (km_age) and Cox PH model (cox_age).

km_age = survfit(Surv(time, event=had_event) ~ age_grp, data=lung_cat)
km.curve <- getKMcurve(km = km_age, 
                       time.col = 'time', 
                       event.col = 'had_event', 
                       group.col = 'age_grp', 
                       data = lung_cat)
km.curve = km.curve %>% 
            select(group, Time, Survival) %>% 
            rename(age_grp=group, time=Time, surv=Survival) %>%
            mutate(age_grp=factor(age_grp, levels=c("39-69", "70+")))
            
cox.curve = data.frame(age_grp=factor(lung_cat$age_grp, levels=c("39-69", "70+")),
                       time=lung_cat$time,
                       surv=exp(-predict(cox_age, type="expected")))
ggplot() +
  geom_line(data=km.curve, aes(x=time, y=surv, color=age_grp), size=0.25) +
  geom_line(data=cox.curve, aes(x=time, y=surv, color=age_grp), size=1, linetype="dashed") +
  ggtitle("Kaplan-Meier vs. Cox Model") +
  ylab("Survival Probability") + 
  theme(title=element_text(hjust=.5),
        text=element_text(size=20))

Model selection with the cox PH model

1) Model with all variables

mod1 = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data=lung)
print(mod1)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     ph.karno + pat.karno + meal.cal + wt.loss, data = lung)
## 
##                 coef  exp(coef)   se(coef)      z        p
## age        2.536e-02  1.026e+00  1.466e-02  1.730 0.083697
## sexF       1.941e+00  6.968e+00  1.429e+00  1.358 0.174356
## ph.ecog    7.936e-01  2.211e+00  2.226e-01  3.565 0.000363
## ph.karno   2.489e-02  1.025e+00  1.128e-02  2.206 0.027409
## pat.karno -1.508e-02  9.850e-01  8.159e-03 -1.848 0.064544
## meal.cal   3.436e-05  1.000e+00  2.567e-04  0.134 0.893538
## wt.loss   -1.624e-02  9.839e-01  7.948e-03 -2.044 0.040951
## age:sexF  -3.960e-02  9.612e-01  2.261e-02 -1.752 0.079834
## 
## Likelihood ratio test=31.38  on 8 df, p=0.0001203
## n= 168, number of events= 121 
##    (60 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

We observe a significant effect of sex:age, ph.ecog, ph.karno, wt.loss. The hazard ratio is 6.9x larger for women than men. The signifcant variables are ph.ecog, ph.karno, wt.loss.

We assess the proportional hazards assumption for our more complex model. Recall: if p < 0.05 (or other alpha), the assumption is violated.

cox.zph(mod1)

##              chisq df      p
## age       1.04e-04  1 0.9919
## sex       9.51e-01  1 0.3296
## ph.ecog   2.43e+00  1 0.1192
## ph.karno  5.22e+00  1 0.0223
## pat.karno 2.74e+00  1 0.0977
## meal.cal  6.66e+00  1 0.0098
## wt.loss   1.43e-01  1 0.7051
## age:sex   1.36e+00  1 0.2437
## GLOBAL    1.56e+01  8 0.0483

2) Model without meal.cal

We fit a model without meal.cal because it was the variable with the highest p-value in previous model and it violates the PH assumption.

mod2 = coxph(Surv(time, had_event) ~ sex*age + ph.ecog + ph.karno + pat.karno + wt.loss, data=lung)
print(mod2)

## Call:
## coxph(formula = Surv(time, had_event) ~ sex * age + ph.ecog + 
##     ph.karno + pat.karno + wt.loss, data = lung)
## 
##                coef exp(coef)  se(coef)      z        p
## sexF       1.678603  5.358066  1.276463  1.315 0.188496
## age        0.026232  1.026579  0.012483  2.101 0.035603
## ph.ecog    0.746184  2.108937  0.199890  3.733 0.000189
## ph.karno   0.022262  1.022511  0.010204  2.182 0.029133
## pat.karno -0.015944  0.984182  0.007255 -2.198 0.027979
## wt.loss   -0.014740  0.985369  0.007104 -2.075 0.037999
## sexF:age  -0.036613  0.964049  0.020199 -1.813 0.069889
## 
## Likelihood ratio test=40.44  on 7 df, p=1.038e-06
## n= 210, number of events= 148 
##    (18 observations deleted due to missingness)

print(paste0("mod2 AIC = ", round(aic(mod2), 1)))

## [1] "mod2 AIC = 1291.7"

All the variables are significant except sex. However, the AIC is much higher (worse) than the previous model including meal.cal.

3) Model with ECOG score, physician-rated Karnofsky score, and weight loss

We build a model with only ECOG score (ph.ecog), physician-rated Karnofsky score (ph.karno), and weight loss (wt.loss), which were the only significant variables in first model (mod1) with all variables.

mod3 = coxph(Surv(time, had_event) ~ ph.ecog + ph.karno + wt.loss, data=lung)
print(mod3)

## Call:
## coxph(formula = Surv(time, had_event) ~ ph.ecog + ph.karno + 
##     wt.loss, data = lung)
## 
##               coef exp(coef)  se(coef)      z        p
## ph.ecog   0.646090  1.908066  0.193993  3.330 0.000867
## ph.karno  0.008908  1.008948  0.010074  0.884 0.376519
## wt.loss  -0.007457  0.992571  0.006499 -1.147 0.251203
## 
## Likelihood ratio test=17.88  on 3 df, p=0.0004658
## n= 213, number of events= 151 
##    (15 observations deleted due to missingness)

print(paste0("mod3 AIC = ", round(aic(mod3), 1)))

## [1] "mod3 AIC = 1338.2"

The fit is significant, but only significant variable is ph.ecog. The AIC is much higher than the previous models as well.

We refit the model with complete cases with ECOG score, physician-rated Karnofsky score, and weight loss, sex and age and their interaction, to compare against mod1 (the model with all variables).

lung_subset = lung[complete.cases(lung[, c("sex", "age", "ph.ecog", "ph.karno", "pat.karno", "meal.cal", "wt.loss")]),]
mod3b = coxph(Surv(time, had_event) ~ ph.ecog + ph.karno + wt.loss, data=lung_subset)
print(mod3b)

## Call:
## coxph(formula = Surv(time, had_event) ~ ph.ecog + ph.karno + 
##     wt.loss, data = lung_subset)
## 
##               coef exp(coef)  se(coef)      z        p
## ph.ecog   0.779269  2.179879  0.219886  3.544 0.000394
## ph.karno  0.016661  1.016801  0.011261  1.480 0.139000
## wt.loss  -0.009295  0.990748  0.007310 -1.272 0.203509
## 
## Likelihood ratio test=16.33  on 3 df, p=0.0009724
## n= 168, number of events= 121

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3b AIC = ", round(aic(mod3b), 1)))

## [1] "mod3b AIC = 1015.5"

anova(mod1, mod3b)

## Analysis of Deviance Table
##  Cox model: response is  Surv(time, had_event)
##  Model 1: ~ age * sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss
##  Model 2: ~ ph.ecog + ph.karno + wt.loss
##    loglik  Chisq Df P(>|Chi|)  
## 1 -497.23                      
## 2 -504.75 15.053  5   0.01014 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison of mod1 and mod3b leads to the conclusion that mod1 fits the data significantly better than mod3b.

We fit the same model adding back the variables age, sex and their interaction.

mod3c = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + ph.karno + wt.loss, data=lung)
print(mod3c)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     ph.karno + wt.loss, data = lung)
## 
##               coef exp(coef)  se(coef)      z        p
## age       0.026663  1.027022  0.012238  2.179   0.0294
## sexF      1.426764  4.165199  1.267294  1.126   0.2602
## ph.ecog   0.803402  2.233126  0.193680  4.148 3.35e-05
## ph.karno  0.016717  1.016858  0.009775  1.710   0.0872
## wt.loss  -0.010327  0.989726  0.006768 -1.526   0.1271
## age:sexF -0.032677  0.967851  0.020026 -1.632   0.1027
## 
## Likelihood ratio test=36.16  on 6 df, p=2.572e-06
## n= 213, number of events= 151 
##    (15 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3 AIC = ", round(aic(mod3), 1)))

## [1] "mod3 AIC = 1338.2"

print(paste0("mod3b AIC = ", round(aic(mod3b), 1)))

## [1] "mod3b AIC = 1015.5"

print(paste0("mod3c AIC = ", round(aic(mod3c), 1)))

## [1] "mod3c AIC = 1325.9"

The model is statistically significant with a p-value equal to 2.572e-06.

We refit the model with complete cases with age*sex, ECOG score, physician-rated Karnofsky score, and weight loss, sex and age and their interaction, to compare against mod1 (the model with all variables).

lung_subset = lung[complete.cases(lung[, c("sex", "age", "ph.ecog", "ph.karno", "pat.karno", "meal.cal", "wt.loss")]),]
mod3d = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + ph.karno + wt.loss, data=lung_subset)
summary(mod3d)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     ph.karno + wt.loss, data = lung_subset)
## 
##   n= 168, number of events= 121 
## 
##               coef exp(coef)  se(coef)      z Pr(>|z|)    
## age       0.023098  1.023367  0.014340  1.611   0.1072    
## sexF      1.424718  4.156684  1.398930  1.018   0.3085    
## ph.ecog   0.883761  2.419983  0.217135  4.070  4.7e-05 ***
## ph.karno  0.021429  1.021661  0.011041  1.941   0.0523 .  
## wt.loss  -0.012775  0.987307  0.007695 -1.660   0.0969 .  
## age:sexF -0.031549  0.968944  0.022129 -1.426   0.1540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## age         1.0234     0.9772    0.9950     1.053
## sexF        4.1567     0.2406    0.2679    64.494
## ph.ecog     2.4200     0.4132    1.5812     3.704
## ph.karno    1.0217     0.9788    0.9998     1.044
## wt.loss     0.9873     1.0129    0.9725     1.002
## age:sexF    0.9689     1.0321    0.9278     1.012
## 
## Concordance= 0.649  (se = 0.029 )
## Likelihood ratio test= 27.97  on 6 df,   p=1e-04
## Wald test            = 28.32  on 6 df,   p=8e-05
## Score (logrank) test = 28.74  on 6 df,   p=7e-05

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

anova(mod1, mod3d)

## Analysis of Deviance Table
##  Cox model: response is  Surv(time, had_event)
##  Model 1: ~ age * sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss
##  Model 2: ~ age * sex + ph.ecog + ph.karno + wt.loss
##    loglik  Chisq Df P(>|Chi|)
## 1 -497.23                    
## 2 -498.93 3.4046  2    0.1823

The comparison of mod1 and mod3d leads to the conclusion that mod3d fits as well as the model with all the variables. Here we dropped meal.cal and pat.karno. The model mod3d is the best so far.

4) Model without ph.karno

We fit a model without ph.karno because it is highly correlated with ph.ecog and with pat.karno. We include all the variables except ph.karno.

mod4 = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + pat.karno + meal.cal + wt.loss, data=lung)
print(mod4)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     pat.karno + meal.cal + wt.loss, data = lung)
## 
##                 coef  exp(coef)   se(coef)      z       p
## age        1.659e-02  1.017e+00  1.389e-02  1.195 0.23228
## sexF       1.522e+00  4.580e+00  1.424e+00  1.069 0.28509
## ph.ecog    4.637e-01  1.590e+00  1.688e-01  2.747 0.00602
## pat.karno -1.207e-02  9.880e-01  8.116e-03 -1.487 0.13702
## meal.cal   2.244e-05  1.000e+00  2.499e-04  0.090 0.92847
## wt.loss   -1.494e-02  9.852e-01  7.834e-03 -1.907 0.05658
## age:sexF  -3.270e-02  9.678e-01  2.252e-02 -1.452 0.14660
## 
## Likelihood ratio test=26.21  on 7 df, p=0.0004619
## n= 168, number of events= 121 
##    (60 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4 AIC = ", round(aic(mod4), 1)))

## [1] "mod4 AIC = 1013.6"

anova(mod1, mod4)

## Analysis of Deviance Table
##  Cox model: response is  Surv(time, had_event)
##  Model 1: ~ age * sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss
##  Model 2: ~ age * sex + ph.ecog + pat.karno + meal.cal + wt.loss
##    loglik  Chisq Df P(>|Chi|)  
## 1 -497.23                      
## 2 -499.81 5.1676  1   0.02301 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model is statistically sginificant and the AIC is close to the AICs of model 1 and model 3d. However, the anova suggests to keep ph.karno.

We fit the same model without meal.cal.

mod4b = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + pat.karno + wt.loss, data=lung)
print(mod4b)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     pat.karno + wt.loss, data = lung)
## 
##                coef exp(coef)  se(coef)      z       p
## age        0.021113  1.021338  0.011998  1.760 0.07845
## sexF       1.370642  3.937877  1.277385  1.073 0.28327
## ph.ecog    0.443006  1.557381  0.145867  3.037 0.00239
## pat.karno -0.013046  0.987038  0.007210 -1.810 0.07037
## wt.loss   -0.013208  0.986879  0.006965 -1.896 0.05793
## age:sexF  -0.030944  0.969530  0.020192 -1.532 0.12541
## 
## Likelihood ratio test=35.43  on 6 df, p=3.559e-06
## n= 210, number of events= 148 
##    (18 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4 AIC = ", round(aic(mod4), 1)))

## [1] "mod4 AIC = 1013.6"

print(paste0("mod4b AIC = ", round(aic(mod4b), 1)))

## [1] "mod4b AIC = 1294.7"

The AIC is much higher (worse) suggesting to keep meal.cal.

We refit the model with complete cases with ECOG score, patient-rated Karnofsky score, and weight loss, sex and age and their interaction, to compare against mod1 (the model with all variables).

lung_subset = lung[complete.cases(lung[, c("sex", "age", "ph.ecog", "ph.karno", "pat.karno", "meal.cal", "wt.loss")]),]
mod4c = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + pat.karno + wt.loss, data=lung_subset)
print(mod4c)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     pat.karno + wt.loss, data = lung_subset)
## 
##                coef exp(coef)  se(coef)      z       p
## age        0.016396  1.016531  0.013719  1.195 0.23202
## sexF       1.517689  4.561672  1.422603  1.067 0.28604
## ph.ecog    0.463542  1.589694  0.168848  2.745 0.00605
## pat.karno -0.011957  0.988115  0.008017 -1.491 0.13585
## wt.loss   -0.014940  0.985171  0.007847 -1.904 0.05691
## age:sexF  -0.032672  0.967856  0.022519 -1.451 0.14681
## 
## Likelihood ratio test=26.2  on 6 df, p=0.0002041
## n= 168, number of events= 121

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod4c AIC = ", round(aic(mod4c), 1)))

## [1] "mod4c AIC = 1011.6"

anova(mod1, mod4c)

## Analysis of Deviance Table
##  Cox model: response is  Surv(time, had_event)
##  Model 1: ~ age * sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss
##  Model 2: ~ age * sex + ph.ecog + pat.karno + wt.loss
##    loglik  Chisq Df P(>|Chi|)  
## 1 -497.23                      
## 2 -499.81 5.1756  2   0.07518 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The AICs are equivalent meaning that this model is as significant as the first model with all the variables. Similarly, the anova has a high p-value indicating that the two models are significant.

We therefore compare the model with ECOG score, patient-rated Karnofsky score, and weight loss, sex and age and their interaction vs ECOG score, physician-rated Karnofsky score, and weight loss, sex and age and their interaction.

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4c AIC = ", round(aic(mod4c), 1)))

## [1] "mod4c AIC = 1011.6"

The AICs are equivalent (<2) therefore any of the two models (mod3d or mod4c) can be used as the best model so far.

5) Model with ph.ecog modeled as categorical

We fit a model with ph.ecog modeled as categorical.

lung_cat = lung %>% mutate(ph.ecog = factor(ph.ecog))
mod5 = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data=lung_cat)
print(mod5)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     ph.karno + pat.karno + meal.cal + wt.loss, data = lung_cat)
## 
##                coef exp(coef)  se(coef)      z        p
## age        0.024468  1.024769  0.014760  1.658 0.097382
## sexF       1.933671  6.914847  1.430977  1.351 0.176601
## ph.ecog1   0.692465  1.998636  0.282483  2.451 0.014232
## ph.ecog2   1.570924  4.811092  0.456257  3.443 0.000575
## ph.ecog3   2.803670 16.505112  1.122027  2.499 0.012463
## ph.karno   0.024520  1.024823  0.011381  2.154 0.031206
## pat.karno -0.014263  0.985839  0.008558 -1.667 0.095587
## meal.cal   0.000034  1.000034  0.000258  0.132 0.895156
## wt.loss   -0.016448  0.983686  0.008025 -2.050 0.040411
## age:sexF  -0.039428  0.961339  0.022652 -1.741 0.081750
## 
## Likelihood ratio test=31.8  on 10 df, p=0.0004323
## n= 168, number of events= 121 
##    (60 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4c AIC = ", round(aic(mod4c), 1)))

## [1] "mod4c AIC = 1011.6"

print(paste0("mod5 AIC = ", round(aic(mod5), 1)))

## [1] "mod5 AIC = 1014"

The fit of the model is significant. However, the AIC is slightly higher.

We fit a model with meal.cal and ph.karno dropped and with ph.ecog modeled as categorical.

mod5b = coxph(Surv(time, had_event) ~ age * sex + ph.ecog + pat.karno + wt.loss, data=lung_cat)
print(mod5b)

## Call:
## coxph(formula = Surv(time, had_event) ~ age * sex + ph.ecog + 
##     pat.karno + wt.loss, data = lung_cat)
## 
##                coef exp(coef)  se(coef)      z       p
## age        0.020624  1.020839  0.012018  1.716 0.08614
## sexF       1.290859  3.635908  1.290049  1.001 0.31701
## ph.ecog1   0.456363  1.578324  0.211800  2.155 0.03119
## ph.ecog2   0.839574  2.315380  0.301590  2.784 0.00537
## ph.ecog3   2.036356  7.662635  1.043126  1.952 0.05092
## pat.karno -0.013848  0.986248  0.007497 -1.847 0.06475
## wt.loss   -0.013170  0.986916  0.006977 -1.888 0.05907
## age:sexF  -0.029600  0.970834  0.020397 -1.451 0.14673
## 
## Likelihood ratio test=35.89  on 8 df, p=1.838e-05
## n= 210, number of events= 148 
##    (18 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4c AIC = ", round(aic(mod4c), 1)))

## [1] "mod4c AIC = 1011.6"

print(paste0("mod5 AIC = ", round(aic(mod5), 1)))

## [1] "mod5 AIC = 1014"

print(paste0("mod5b AIC = ", round(aic(mod5b), 1)))

## [1] "mod5b AIC = 1298.3"

The fit of the model is still significant but the AIC is much higher.

We look at the Kaplan-Meier plot of ph.ecog when it is categorical.

# visualize (KM plot) effect of ph.ecog
ggsurvplot(survfit(Surv(time, event=had_event) ~ ph.ecog, data=lung_cat), 
           xlab="Days", 
           ylab="Overall survival probability",
           pval = TRUE,
           risk.table=FALSE,
           conf.int=TRUE,
           surv.median.line="hv")

We compute the log-rank to test the effect of ph.ecog on survival time.

survdiff(Surv(time, had_event) ~ ph.ecog, lung_cat)

## Call:
## survdiff(formula = Surv(time, had_event) ~ ph.ecog, data = lung_cat)
## 
## n=227, 1 observation deleted due to missingness.
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## ph.ecog=0  63       37   54.153    5.4331    8.2119
## ph.ecog=1 113       82   83.528    0.0279    0.0573
## ph.ecog=2  50       44   26.147   12.1893   14.6491
## ph.ecog=3   1        1    0.172    3.9733    4.0040
## 
##  Chisq= 22  on 3 degrees of freedom, p= 7e-05

The test suggests that the ECOG score is statistically significant.

We test the proportional hazards assumption for ph.ecog.

m = survfit(Surv(time, had_event) ~ ph.ecog, lung_cat)
s = summary(m)
s_table = data.frame(s$strata, s$time, s$n.risk, s$n.event, s$n.censor, s$surv, s$lower, s$upper)
s_table = s_table %>%
              rename(strata=s.strata, time=s.time, surv=s.surv, lower=s.lower, upper=s.upper) %>%
              mutate(negloglogsurv=-log(-log(surv)))

ggplot(s_table, aes(x=time, y=negloglogsurv, color=strata)) + 
  geom_line(size=1.25) +
  theme(text=element_text(size=16),
        plot.title=element_text(hjust=0.5)) +
  ylab("-ln(-ln(S(t)))") +
  ggtitle("-log-log plot of survival time by ECOG score")

The curves cross, so the PH assumption is violated !

We can also test with the Goodness of Fit (GOF) Testing Approach based on the residuals defined by Schoenfeld.

We fit a model with only the categorical ph.ecog.

# statistical test for violation of PH assumption
mod5c = coxph(Surv(time, had_event) ~ ph.ecog, data=lung_cat)
print(mod5c)

## Call:
## coxph(formula = Surv(time, had_event) ~ ph.ecog, data = lung_cat)
## 
##            coef exp(coef) se(coef)     z        p
## ph.ecog1 0.3688    1.4461   0.1987 1.857   0.0634
## ph.ecog2 0.9164    2.5002   0.2245 4.081 4.48e-05
## ph.ecog3 2.2080    9.0973   1.0258 2.152   0.0314
## 
## Likelihood ratio test=18.44  on 3 df, p=0.0003561
## n= 227, number of events= 164 
##    (1 observation deleted due to missingness)

print(paste0("mod5c AIC = ", round(aic(mod5c), 1)))

## [1] "mod5c AIC = 1476.5"

test.ph <- cox.zph(mod5c)
print(test.ph)

##         chisq df    p
## ph.ecog  4.88  3 0.18
## GLOBAL   4.88  3 0.18

ggcoxzph(test.ph)

Surprisingly, this test indicates that the PH assumption is not violated.

As a remedy, we carry out a stratified Cox (SC) procedure for the analysis.

The “stratified Cox model” is a modification of the Cox proportional hazards (PH) model that allows for control by “stratification” of a predictor that does not satisfy the PH assumption.

The general SC model is:

\[h_{g}(t, X) = h_{0g}(t)exp(\beta_{1}X_{i,1} + ... +\ beta_{p}X_{i,p})\]

for g = 0, 1, 2 and 3 (the different values of the ECOG scores).

In this model, the baseline hazard function \(h_{0g}(t)\) is allowed to be different for each stratum. However, the coefficients \(b_1\), \(b_2\),…, \(b_p\) are the same for each stratum. It is called the No-Interaction Assumption because the \(\beta\)’s in the model are the same for each subscript \(g\). The no- interaction assumption means that the variables being stratified are assumed not to interact with the X’s in the model.

Using SC, we can control for the ph.ecog variable – which does not satisfy the PH assumption – by stratification while simultaneously including in the model the other variables – which do satisfy the PH assumption.

mod5d = coxph(Surv(time, had_event) ~ strata(ph.ecog) + sex*age + pat.karno + wt.loss, data=lung_cat)
print(mod5d)

## Call:
## coxph(formula = Surv(time, had_event) ~ strata(ph.ecog) + sex * 
##     age + pat.karno + wt.loss, data = lung_cat)
## 
##                coef exp(coef)  se(coef)      z      p
## sexF       1.049484  2.856176  1.289971  0.814 0.4159
## age        0.019087  1.019270  0.011976  1.594 0.1110
## pat.karno -0.014319  0.985783  0.007478 -1.915 0.0555
## wt.loss   -0.011159  0.988903  0.006940 -1.608 0.1078
## sexF:age  -0.025847  0.974484  0.020414 -1.266 0.2055
## 
## Likelihood ratio test=18.48  on 5 df, p=0.002397
## n= 210, number of events= 148 
##    (18 observations deleted due to missingness)

Note, that the results of the test do not include the variable ph.ecog anymore.

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4c AIC = ", round(aic(mod4c), 1)))

## [1] "mod4c AIC = 1011.6"

print(paste0("mod5 AIC = ", round(aic(mod5), 1)))

## [1] "mod5 AIC = 1014"

print(paste0("mod5b AIC = ", round(aic(mod5b), 1)))

## [1] "mod5b AIC = 1298.3"

print(paste0("mod5d AIC = ", round(aic(mod5d), 1)))

## [1] "mod5d AIC = 991.8"

The AIC of the last model is consequently lower (better). The model 6 is the bets so far.

ggadjustedcurves(mod5d, data=(lung_cat %>% filter(ph.ecog != 3)))

## The variable argument is missing. Using ph.ecog as extracted from strata

6) Model with age modeled as categorical

lung_cat = lung_cat %>% 
  mutate(age_grp = cut(lung$age, breaks=c(-Inf, 69, Inf), labels=c("39-69", "70+")))

mod6 = coxph(Surv(time, had_event) ~ strata(ph.ecog) + sex*age_grp + pat.karno + wt.loss, data=lung_cat)
print(mod6)

## Call:
## coxph(formula = Surv(time, had_event) ~ strata(ph.ecog) + sex * 
##     age_grp + pat.karno + wt.loss, data = lung_cat)
## 
##                      coef exp(coef)  se(coef)      z      p
## sexF            -0.458800  0.632042  0.212445 -2.160 0.0308
## age_grp70+       0.325746  1.385063  0.227563  1.431 0.1523
## pat.karno       -0.014171  0.985929  0.007537 -1.880 0.0601
## wt.loss         -0.009306  0.990737  0.006827 -1.363 0.1729
## sexF:age_grp70+ -0.418763  0.657860  0.433796 -0.965 0.3344
## 
## Likelihood ratio test=17.73  on 5 df, p=0.003299
## n= 210, number of events= 148 
##    (18 observations deleted due to missingness)

print(paste0("mod1 AIC = ", round(aic(mod1), 1)))

## [1] "mod1 AIC = 1010.5"

print(paste0("mod3d AIC = ", round(aic(mod3d), 1)))

## [1] "mod3d AIC = 1009.9"

print(paste0("mod4c AIC = ", round(aic(mod4c), 1)))

## [1] "mod4c AIC = 1011.6"

print(paste0("mod5 AIC = ", round(aic(mod5), 1)))

## [1] "mod5 AIC = 1014"

print(paste0("mod5b AIC = ", round(aic(mod5b), 1)))

## [1] "mod5b AIC = 1298.3"

print(paste0("mod5d AIC = ", round(aic(mod5d), 1)))

## [1] "mod5d AIC = 991.8"

print(paste0("mod6 AIC = ", round(aic(mod6), 1)))

## [1] "mod6 AIC = 992.5"

The model is also significant and the AIC is equivalent with the previous model. We keep the model 6 as the final model.

Model selection with glmulti

We perform the best subset selection via glmulti:

lung2 <- lung %>% mutate(sex_age = as.numeric(sex)*age)
models <- glmulti(Surv(time, had_event) ~ sex + age + sex_age + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, 
          data = lung2,
          level = 1,               # No interaction considered
          method = "h",            # Exhaustive approach
          crit = "aic",            # AIC as criteria
          confsetsize = 5,         # Keep 5 best models
          plotty = F, report = F,  # No plot or interim reports
          fitfunction = "coxph")   # coxph function

models@formulas

## [[1]]
## Surv(time, had_event) ~ 1 + age + sex_age + ph.ecog + ph.karno + 
##     pat.karno + meal.cal + wt.loss
## <environment: 0x00000000676f36b8>
## 
## [[2]]
## Surv(time, had_event) ~ 1 + sex + ph.ecog + ph.karno + pat.karno + 
##     meal.cal + wt.loss
## <environment: 0x00000000676f36b8>
## 
## [[3]]
## Surv(time, had_event) ~ 1 + sex + age + sex_age + ph.ecog + ph.karno + 
##     pat.karno + meal.cal + wt.loss
## <environment: 0x00000000676f36b8>
## 
## [[4]]
## Surv(time, had_event) ~ 1 + sex + age + ph.ecog + ph.karno + 
##     pat.karno + meal.cal + wt.loss
## <environment: 0x00000000676f36b8>
## 
## [[5]]
## Surv(time, had_event) ~ 1 + sex + ph.ecog + pat.karno + meal.cal + 
##     wt.loss
## <environment: 0x00000000676f36b8>

# slightly lower (better) AIC than above
AIC(models@objects[[1]])

## [1] 1010.277

AIC(models@objects[[2]])

## [1] 1010.358

AIC(models@objects[[3]])

## [1] 1010.453

AIC(models@objects[[4]])

## [1] 1011.504

AIC(models@objects[[5]])

## [1] 1011.952

With this method, the best model has all variables!