Pseudo-observations

Roemer J. Janse

2023-10-31

Survival analysis

When do we use survival analysis?

  • Time to event data
  • Multiple hurdles in data
    • Censoring
    • Truncation
    • Competing events

NCCTG Lung Cancer Data

Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group.

head(lung)
# A tibble: 6 × 12
     id  inst  time status   age   sex ph.ecog ph.karno pat.karno meal.cal
  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>    <dbl>
1     1     3   306      2    74     0       1       90       100     1175
2     2     3   455      2    68     0       0       90        90     1225
3     3     3  1010      1    56     0       0       90        90       NA
4     4     5   210      2    57     0       1       90        60     1150
5     5     1   883      2    60     0       0      100        90       NA
6     6    12  1022      1    74     0       1       50        80      513
# ℹ 2 more variables: wt.loss <dbl>, months <dbl>

Status: 1 = censored, 2 = dead

Studying survival

A common method is to use a Kaplan-Meier curve to compare survival in the presence of censoring.

# Plot survival
lung %>%
    # Fit survival model
    survfit(Surv(months, status) ~ sex, data = .) %>%
    # Change to data frame
    tidy() %>%
    # Plot data
    ggplot(aes(x = time, y = estimate, ymin = conf.low, ymax = conf.high, colour = strata, fill = strata)) +
        # Geometries
        geom_step() +
        geom_stepribbon(alpha = 0.25, colour = NA) +
        # Scaling
        scale_x_continuous(limits = c(0, 36), expand = expansion(add = 0.1), 
                           breaks = seq(0, 36, 3), labels = seq(0, 36, 3), name = "Time (months)") +
        scale_y_continuous(limits = c(0, 1), expand = c(0, 0), 
                           breaks = seq(0, 1, 0.1), labels = seq(0, 1, 0.1), name = "Proportion alive") +
        scale_colour_manual(values = c("darkred", "darkorange"), labels = c("Men", "Women")) +
        scale_fill_manual(values = c("darkred", "darkorange"), labels = c("Men", "Women")) +
        # Aesthetics
        theme_custom()

Pseudo-observations

Andersen PK, Perme MP. Stat Methods Med Res. 2010

General idea

Mathematical

Written

Per person, we observe a survival time \(T\) for each individual \(i\).

Every person survives for a certain amount of time.

If for each individual \(T\) is complete, we could apply a linear model to \(T\), or apply binary methods by treating it as \(I(T \le t)\) for any reasonable \(t\).

If we know the survival time of each person, we could use that as a continuous outcome or binary outcome after dichotomizing.

We could set up models for any function, \(f(T)\).

We could use time as outcome in any analysis.

Then, we could estimate the expected value \(E(f(T))\) as \(1 / n \sum_i{f(T_i)}\).

We could then calculate the expected value of time by taking the mean of each individual’s time.

What if we do not observe \(T\) for each individual \(i\)?

What if we do not know the survival time of each person?

In that case, we can substitute \(f(T)\) with pseudo-observations.

In that case, we can use pseudo-observations instead of time.

Calculating pseudo-observations

Mathematical

Written

Suppose we do not observe all \(T_i\), but we have a well-behaved estimator \(\hat{\theta}\) to estimate \(E(f(T))\).

What if we do not know the survival time of each individual, but we do have some kind of summary measure that could calculate the expected value of time.

This could be the Kaplan-Meier estimate \(S(t) = E(I(T > t))\).

The summary measure could simply be the Kaplan-Meier estimate for a certain point in time.

We can then estimate \(\hat{\theta}^{-i}\) for each individual \(i\), \(i = 1, ..., n\), where \(\hat{\theta}^{-i}\) is the estimator applied to a data set of size \(n\) - 1 obtained by excluding the \(i^{th}\) individual.

We can calculate a specific summary measure for each person, by calculating the summary measure on the whole population minus that person, for each person.

Using \(\hat{\theta}^{-i}\), we can estimate the pseudo-observation \(\hat{\theta_i}\) for each \(i\) as \(\hat{\theta_i} = n * \hat{\theta} - (n - 1) * \hat{\theta}^{-i}\).

Then per person we can calculate their pseudo-observation based on the difference between the total summary measure and the summary measure if they are excluded.

We can now replace the incomplete observed \(f(T_i)\) by \(\hat{\theta_i}\). \(\hat{\theta_i}\) can then be used as an outcome variable. The pseudo-osbervation \(\hat{\theta_i}\) will always be used for all subjects, even if \(f(T_i)\) was observed.

Instead of using time as an outcome, we can now use the pseudo-observations as outcome. If we use pseudo-observations, we use them for every individual, regardless whether we observed time for them.

Intuitively, the \(i^{th}\) pseudo-observation is the contribution of individual \(i\) to the estimate of \(E(f(T))\).

Intuitively, each person’s pseudo-observation is how much they contribute to the overall estimate on the population.

Characteristics of the pseudo-observation

Now let’s calculate some pseudo-observations using the NCCTG Lung Cancer Data for a few individuals.

# Function to get survival estimates
surv_est <- function(data, strat_var = 1){
    # Get estimate for each timepoint
    return(survfit(as.formula(paste0("Surv(time, status) ~ ", strat_var)), data = data) %>%
               # Create life table
               tidy() %>%
               # Keep only relevant variables
               select(time, estimate, any_of("strata")))
}

# Function to join data
pseudo_join <- function(data_ind, join_by = "time"){
    # Join data and return
    return(data_ind %>%
        # Join overall data based on time
        left_join(overall, join_by) %>%
        # Calculate pseudo-observation for each person at each time
        mutate(po = n * est_overall - (n - 1) * estimate) %>%
        # Keep only relevant variables
        select(id, time, po, any_of("strata")) %>%
        # Change time to months
        mutate(time = time / (365.25 / 12)))
}

# Get overall estimate for each timepoint
overall <- surv_est(lung) %>%
    # Rename variables
    rename(est_overall = estimate)

# Get estimate at each timepoint without person 2
individual <- surv_est(filter(lung, id != 2)) %>%
    # Create id variable
    mutate(id = 2)

# Add data together
dat_pseudo <- pseudo_join(individual)

# Individual survival curve
single_surv <- tibble(time = c(0, filter(lung, id == 2)[["time"]] / (365.25 / 12), max(dat_pseudo[["time"]])),
                      estimate = c(1, 0, 0)) %>%
    # Create plot
    ggplot(aes(x = time, y = estimate)) +
        # Geometries
        geom_step(colour = "darkred") +
        # Scaling
        scale_x_continuous(limits = c(0, 36), expand = expansion(add = 0.1), 
                           breaks = seq(0, 36, 3), labels = seq(0, 36, 3), name = "Time (months)") +
        scale_y_continuous(limits = c(-0.8, 1.6), expand = c(0, 0), round(seq(-0.8, 1.6, 0.2), 1),
                           name = expression("Survival estimate " * hat("S(t)"))) +
        # Aesthetics
        theme_custom() 

# Plot individual pseudo curve
single_pseudo <- ggplot(dat_pseudo, aes(x = time, y = po)) +
    # Geometries
    geom_step(colour = "darkred") +
    # Scaling
    scale_x_continuous(limits = c(0, 36), expand = expansion(add = 0.1), 
                       breaks = seq(0, 36, 3), labels = seq(0, 36, 3), name = "Time (months)") +
    scale_y_continuous(limits = c(-0.8, 1.6), expand = c(0, 0), round(seq(-0.8, 1.6, 0.2), 1),
                       name = expression("Pseudo-value " * hat(theta[i]))) +
    # Aesthetics
    theme_custom() 

# Get estimates for four different people
individual <- lapply(c(174, 2, 104, 214), \(x) surv_est(filter(lung, id != x)) %>% mutate(id = x)) %>%
    # Bind data together
    bind_rows()

# Add data together
dat_pseudo <- pseudo_join(individual)

# Plot individual pseudo curves
single_pseudos <- ggplot(dat_pseudo, aes(x = time, y = po, colour = factor(id), fill = factor(id))) +
    # Geometries
    geom_step() +
    # Scaling
    scale_x_continuous(limits = c(0, 36), expand = expansion(add = 0.1), 
                       breaks = seq(0, 36, 3), labels = seq(0, 36, 3), name = "Time (months)") +
    scale_y_continuous(limits = c(-0.8, 1.6), expand = c(0, 0), round(seq(-0.8, 1.6, 0.2), 1),
                       name = expression("Pseudo-value " * hat(theta[i]))) +
    scale_colour_manual(values = c("darkred", "darkorange", "darkblue", "darkgreen"),
                        labels = c("Died", "Died", "Censored", "Censored")) +
    scale_fill_manual(values = c("darkred", "darkorange", "darkblue", "darkgreen"),
                      labels = c("Died", "Died", "Censored", "Censored")) +
    # Aesthetics
    theme_custom() 
    

Relation to Kaplan-Meier

Individual pseudo-observations are difficult to interpret, but if we average them, they are almost exactly the KM-curve.

# Normal KM curve stratified by sex
lung %>%
    # Fit survival model
    survfit(Surv(time, status) ~ sex, data = .) %>%
    # Change to data frame
    tidy() %>%
    # Change time to months
    mutate(time = time / (365.25 / 12)) %>%
    # Plot data
    ggplot(aes(x = time, y = estimate, colour = strata, fill = strata)) +
        # Geometries
        geom_step() +
        # Scaling
        scale_x_continuous(limits = c(0, 36), expand = expansion(add = 0.1), 
                           breaks = seq(0, 36, 3), labels = seq(0, 36, 3), name = "Time (months)") +
        scale_y_continuous(limits = c(0, 1), expand = c(0, 0), breaks = seq(0, 1, 0.1),
                           labels = seq(0, 1, 0.1), name = "Proportion alive") +
        scale_colour_manual(values = c("darkred", "darkorange"), labels = c("Men", "Women")) +
        scale_fill_manual(values = c("darkred", "darkorange"), labels = c("Men", "Women")) +
        # Aesthetics
        theme_custom()

# Overall estimates
overall <- surv_est(lung, "sex") %>%
    # Rename variable
    rename(est_overall = estimate)

# Estimates with individuals eliminated
individual <- lapply(lung[["id"]], \(x) surv_est(filter(lung, id != x), "sex") %>% mutate(id = x)) %>%
    # Bind data together
    bind_rows()

# Add data together
dat_pseudo <- pseudo_join(individual, c("time", "strata")) %>%
    # Arrange for grouping
    arrange(strata, time) %>%
    # Group per time in each stratum
    group_by(strata, time) %>%
    # Get mean estimate
    summarise(po = mean(po), .groups = "drop") 

# Plot data
ggplot(dat_pseudo, aes(x = time, y = po, colour = strata, fill = strata)) +
    # Geometries
    geom_step() +
    # Scaling
    scale_x_continuous(limits = c(0, 36), expand = expansion(add = 0.1), 
                       breaks = seq(0, 36, 3), labels = seq(0, 36, 3), name = "Time (months)") +
    scale_y_continuous(limits = c(0, 1), expand = c(0, 0), breaks = seq(0, 1, 0.1),
                       labels = seq(0, 1, 0.1), name = "Proportion alive") +
    scale_colour_manual(values = c("darkred", "darkorange"), labels = c("Men", "Women")) +
    scale_fill_manual(values = c("darkred", "darkorange"), labels = c("Men", "Women")) +
    # Aesthetics
    theme_custom()

Practical applications

Pohar Perme M. Simplifying survival analysis with pseudo-observations. Presented at: Advanced Survival Analysis Course; September 4, Leiden.

Survival at time \(t\)

We can use pseudo-observations to determine survival at a certain moment in time \(t\) using simple linear regression.

# Get data at 12 months
dat_12m <- filter(dat_pseudo, month <= 12) %>%
    # Arrange for grouping
    arrange(id, month) %>%
    # Group per person
    group_by(id) %>%
    # Keep only last observation
    slice_tail() %>%
    # Ungroup again
    ungroup()

# Fit linear model at t = 12
lm(pseudo_observation ~ sex, data = dat_12m) %>%
    # Get model summary
    summary()

Call:
lm(formula = pseudo_observation ~ sex, data = dat_12m)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8164 -0.4372 -0.3347  0.6018  0.7908 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.33466    0.04565   7.331 4.03e-12 ***
sex          0.18894    0.07266   2.600  0.00992 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5363 on 226 degrees of freedom
Multiple R-squared:  0.02905,   Adjusted R-squared:  0.02476 
F-statistic: 6.762 on 1 and 226 DF,  p-value: 0.009924

Survival at time \(t\)

We can use pseudo-observations to determine survival at a certain moment in time \(t\) using simple linear regression.

# Get data at 18 months
dat_18m <- filter(dat_pseudo, month <= 18) %>%
    # Arrange for grouping
    arrange(id, month) %>%
    # Group per person
    group_by(id) %>%
    # Keep only last observation
    slice_tail() %>%
    # Ungroup again
    ungroup()

# Fit linear model at t = 18
lm(pseudo_observation ~ sex, data = dat_18m) %>%
    # Get model summary
    summary()

Call:
lm(formula = pseudo_observation ~ sex, data = dat_18m)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7739 -0.3613 -0.1894  0.2093  1.1292 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.18642    0.04398   4.239 3.28e-05 ***
sex          0.17488    0.07000   2.498   0.0132 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5167 on 226 degrees of freedom
Multiple R-squared:  0.02687,   Adjusted R-squared:  0.02257 
F-statistic: 6.241 on 1 and 226 DF,  p-value: 0.01319

Generalized estimating equations

Using generalized estimating equations (GEEs), we can use pseudo-observations as an outcome with different link functions for different modelling approaches:

  • Linear model: \(y\)
  • Logistic model: \(log(\frac{y}{1 - y})\)
  • Cox model: \(log(-log(S(t)))\)
  • Restricted mean survival time: \(E(T \wedge \tau | Z)\) or \(log(E(T \wedge \tau | Z))\)
  • Fine-Gray model: \(log(-log(1 - C_1(t)))\)

Normally, we could run a Cox model:

# Fit Cox model for hazard ratio
coxph(Surv(time, status) ~ sex, data = lung) %>%
    # Get model summary
    summary()
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)

  n= 228, number of events= 165 

       coef exp(coef) se(coef)      z Pr(>|z|)   
sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
sex     0.588      1.701    0.4237     0.816

Concordance= 0.579  (se = 0.021 )
Likelihood ratio test= 10.63  on 1 df,   p=0.001
Wald test            = 10.09  on 1 df,   p=0.001
Score (logrank) test = 10.33  on 1 df,   p=0.001

We could also use the pseudo-observations with GEEs:

# Fit GEE with pseudo-observations for hazard ratio
fit <- geese(1 - pseudo_observation ~ sex + as.factor(round(month)), 
              data = dat_pseudo, mean.link = "cloglog", id = id, 
              jack = TRUE, scale.fix = TRUE)

# Transform estimate to hazard ratio
exp(-exp(fit[["beta"]][["sex"]]))
[1] 0.5850911

Call:
geese(formula = 1 - pseudo_observation ~ sex + as.factor(round(month)), 
    id = id, data = dat_pseudo, mean.link = "cloglog", scale.fix = TRUE, 
    jack = TRUE)

Mean Model:
 Mean Link:                 cloglog 
 Variance to Mean Relation: gaussian 

 Coefficients:
              (Intercept)                       sex  as.factor(round(month))1 
               -3.6918044                -0.6236440                 0.7654872 
 as.factor(round(month))2  as.factor(round(month))3  as.factor(round(month))4 
                1.3916559                 1.8761836                 2.1968367 
 as.factor(round(month))5  as.factor(round(month))6  as.factor(round(month))7 
                2.5002499                 2.8332901                 3.0708883 
 as.factor(round(month))8  as.factor(round(month))9 as.factor(round(month))10 
                3.2163835                 3.3317696                 3.5004182 
as.factor(round(month))11 as.factor(round(month))12 as.factor(round(month))13 
                3.6331097                 3.7847453                 3.8815408 
as.factor(round(month))14 as.factor(round(month))15 as.factor(round(month))16 
                3.9438051                 4.0598295                 4.1343341 
as.factor(round(month))17 as.factor(round(month))18 as.factor(round(month))19 
                4.2033542                 4.3004767                 4.3789626 
as.factor(round(month))20 as.factor(round(month))21 as.factor(round(month))22 
                4.4171821                 4.4922323                 4.5796375 
as.factor(round(month))23 as.factor(round(month))24 as.factor(round(month))25 
                4.6752738                 4.8838799                 5.0177309 
as.factor(round(month))26 as.factor(round(month))27 as.factor(round(month))28 
                5.0989874                 5.1815052                 5.1619400 
as.factor(round(month))29 as.factor(round(month))32 as.factor(round(month))33 
                5.1947034                 5.3749510                 5.2464295 
as.factor(round(month))34 
                5.2464295 

Scale is fixed.

Correlation Model:
 Correlation Structure:     independence 

Returned Error Value:  0 
Number of clusters:   228   Maximum cluster size: 186 

The variance \(\hat{var}(\hat{\beta})\) can be estimated using a sandwich estimator or bootstrapping.

But do not forget: the real practical value lies in the cases where no other standard methods apply.

Visualizing with pseudo-observations

Pseudo-observations can also be used to visualize model assumptions. More in Andersen PK, Perme MP. Stat Methods Med Res. 2010.

We can also use it to create a calibration plot in prediction modelling:

Calibration-in-the-large        Calibration slope              C statistic 
                   1.107                    1.000         0.61 (0.55-0.66) 

# Create small Fine-Gray prediction model
model_vars <- dev(lung_cr, "Surv(time, status_cr) ~ age + sex + ph.ecog + ph.karno + meal.cal + wt.loss", "fine-gray")

# Validate model
pred(lung_cr, "fine-gray", status_cr, time, lpsamp = lpsamp(lung_cr)) %>%
    # Validate
    validate(observed, pred, lp, "fine-gray", time, histogram_label = 0.1, deciles = FALSE)

Further reading

  • Andersen P.K., Pohar Perme M. Pseudo observations in survival analysis. Stat Methods Med Res 2010.
  • Pavlic K., Pohar Perme M. Using pseudo-observations for estimation in relative survival. Biostatistics 2018.
  • Pohar Perme M., Andersen P.K. Checking hazard regression models using pseudo-observations. Stat Med 2008.
  • Parner E.T., Andersen P.K., Overgaard M. Regression models for censored time-to-event data using infinitesimal jack-knife pseudo-observations, with applications to left-truncation. Lifetime Data Anal 2023.
  • Ambrogi F., Iacobelli S., Andersen P.K. Analyzing differences between restricted mean survival time curves using pseudo-values. BMC Med Res Methodol 2022.
  • Andersen P.K., Syriopoulou E., Parner E.T. Causal inference in survival analysis using pseudo-observations Stat Med 2017.
  • Andersen P.K. Decomposition of number of life years lost according to cause of death Stat Med 2023.
  • Andersen P.K., Klein J.P., Rosthøj S. Generalized linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 2003.
  • Graw F., Gerds T.A., Schumacher M. On pseudo-values for regression analysis in multi-state models.
  • Sabathé C., Andersen P.K., Helmer, C. et al. Regression analysis in an illness-death model with interval-censored data: A pseudo-value approach. Stat Methods Med Res 2020.
  • Tanaka S., Brookhart M.A., Fine J.P. G-estimation of structural nested mean models for competing risks data using pseudo-observations Biostatistics 2020.