Per person, we observe a survival time \(T\) for each individual \(i\).
Every person survives for a certain amount of time.
If for each individual \(T\) is complete, we could apply a linear model to \(T\), or apply binary methods by treating it as \(I(T \le t)\) for any reasonable \(t\).
If we know the survival time of each person, we could use that as a continuous outcome or binary outcome after dichotomizing.
We could set up models for any function, \(f(T)\).
We could use time as outcome in any analysis.
Then, we could estimate the expected value \(E(f(T))\) as \(1 / n \sum_i{f(T_i)}\).
We could then calculate the expected value of time by taking the mean of each individual’s time.
What if we do not observe \(T\) for each individual \(i\)?
What if we do not know the survival time of each person?
In that case, we can substitute \(f(T)\) with pseudo-observations.
In that case, we can use pseudo-observations instead of time.
Calculating pseudo-observations
Mathematical
Written
Suppose we do not observe all \(T_i\), but we have a well-behaved estimator \(\hat{\theta}\) to estimate \(E(f(T))\).
What if we do not know the survival time of each individual, but we do have some kind of summary measure that could calculate the expected value of time.
This could be the Kaplan-Meier estimate \(S(t) = E(I(T > t))\).
The summary measure could simply be the Kaplan-Meier estimate for a certain point in time.
We can then estimate \(\hat{\theta}^{-i}\) for each individual \(i\), \(i = 1, ..., n\), where \(\hat{\theta}^{-i}\) is the estimator applied to a data set of size \(n\) - 1 obtained by excluding the \(i^{th}\) individual.
We can calculate a specific summary measure for each person, by calculating the summary measure on the whole population minus that person, for each person.
Using \(\hat{\theta}^{-i}\), we can estimate the pseudo-observation \(\hat{\theta_i}\) for each \(i\) as \(\hat{\theta_i} = n * \hat{\theta} - (n - 1) * \hat{\theta}^{-i}\).
Then per person we can calculate their pseudo-observation based on the difference between the total summary measure and the summary measure if they are excluded.
We can now replace the incomplete observed \(f(T_i)\) by \(\hat{\theta_i}\). \(\hat{\theta_i}\) can then be used as an outcome variable. The pseudo-osbervation \(\hat{\theta_i}\) will always be used for all subjects, even if \(f(T_i)\) was observed.
Instead of using time as an outcome, we can now use the pseudo-observations as outcome. If we use pseudo-observations, we use them for every individual, regardless whether we observed time for them.
Intuitively, the \(i^{th}\) pseudo-observation is the contribution of individual \(i\) to the estimate of \(E(f(T))\).
Intuitively, each person’s pseudo-observation is how much they contribute to the overall estimate on the population.
Characteristics of the pseudo-observation
Now let’s calculate some pseudo-observations using the NCCTG Lung Cancer Data for a few individuals.
# Normal KM curve stratified by sexlung %>%# Fit survival modelsurvfit(Surv(time, status) ~ sex, data = .) %>%# Change to data frametidy() %>%# Change time to monthsmutate(time = time / (365.25/12)) %>%# Plot dataggplot(aes(x = time, y = estimate, colour = strata, fill = strata)) +# Geometriesgeom_step() +# Scalingscale_x_continuous(limits =c(0, 36), expand =expansion(add =0.1), breaks =seq(0, 36, 3), labels =seq(0, 36, 3), name ="Time (months)") +scale_y_continuous(limits =c(0, 1), expand =c(0, 0), breaks =seq(0, 1, 0.1),labels =seq(0, 1, 0.1), name ="Proportion alive") +scale_colour_manual(values =c("darkred", "darkorange"), labels =c("Men", "Women")) +scale_fill_manual(values =c("darkred", "darkorange"), labels =c("Men", "Women")) +# Aestheticstheme_custom()# Overall estimatesoverall <-surv_est(lung, "sex") %>%# Rename variablerename(est_overall = estimate)# Estimates with individuals eliminatedindividual <-lapply(lung[["id"]], \(x) surv_est(filter(lung, id != x), "sex") %>%mutate(id = x)) %>%# Bind data togetherbind_rows()# Add data togetherdat_pseudo <-pseudo_join(individual, c("time", "strata")) %>%# Arrange for groupingarrange(strata, time) %>%# Group per time in each stratumgroup_by(strata, time) %>%# Get mean estimatesummarise(po =mean(po), .groups ="drop") # Plot dataggplot(dat_pseudo, aes(x = time, y = po, colour = strata, fill = strata)) +# Geometriesgeom_step() +# Scalingscale_x_continuous(limits =c(0, 36), expand =expansion(add =0.1), breaks =seq(0, 36, 3), labels =seq(0, 36, 3), name ="Time (months)") +scale_y_continuous(limits =c(0, 1), expand =c(0, 0), breaks =seq(0, 1, 0.1),labels =seq(0, 1, 0.1), name ="Proportion alive") +scale_colour_manual(values =c("darkred", "darkorange"), labels =c("Men", "Women")) +scale_fill_manual(values =c("darkred", "darkorange"), labels =c("Men", "Women")) +# Aestheticstheme_custom()
Practical applications
Pohar Perme M. Simplifying survival analysis with pseudo-observations. Presented at: Advanced Survival Analysis Course; September 4, Leiden.
Survival at time \(t\)
We can use pseudo-observations to determine survival at a certain moment in time \(t\) using simple linear regression.
# Get data at 12 monthsdat_12m <-filter(dat_pseudo, month <=12) %>%# Arrange for groupingarrange(id, month) %>%# Group per persongroup_by(id) %>%# Keep only last observationslice_tail() %>%# Ungroup againungroup()# Fit linear model at t = 12lm(pseudo_observation ~ sex, data = dat_12m) %>%# Get model summarysummary()
Call:
lm(formula = pseudo_observation ~ sex, data = dat_12m)
Residuals:
Min 1Q Median 3Q Max
-0.8164 -0.4372 -0.3347 0.6018 0.7908
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.33466 0.04565 7.331 4.03e-12 ***
sex 0.18894 0.07266 2.600 0.00992 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5363 on 226 degrees of freedom
Multiple R-squared: 0.02905, Adjusted R-squared: 0.02476
F-statistic: 6.762 on 1 and 226 DF, p-value: 0.009924
Survival at time \(t\)
We can use pseudo-observations to determine survival at a certain moment in time \(t\) using simple linear regression.
# Get data at 18 monthsdat_18m <-filter(dat_pseudo, month <=18) %>%# Arrange for groupingarrange(id, month) %>%# Group per persongroup_by(id) %>%# Keep only last observationslice_tail() %>%# Ungroup againungroup()# Fit linear model at t = 18lm(pseudo_observation ~ sex, data = dat_18m) %>%# Get model summarysummary()
Call:
lm(formula = pseudo_observation ~ sex, data = dat_18m)
Residuals:
Min 1Q Median 3Q Max
-0.7739 -0.3613 -0.1894 0.2093 1.1292
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18642 0.04398 4.239 3.28e-05 ***
sex 0.17488 0.07000 2.498 0.0132 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5167 on 226 degrees of freedom
Multiple R-squared: 0.02687, Adjusted R-squared: 0.02257
F-statistic: 6.241 on 1 and 226 DF, p-value: 0.01319
Generalized estimating equations
Using generalized estimating equations (GEEs), we can use pseudo-observations as an outcome with different link functions for different modelling approaches:
Linear model: \(y\)
Logistic model: \(log(\frac{y}{1 - y})\)
Cox model: \(log(-log(S(t)))\)
Restricted mean survival time: \(E(T \wedge \tau | Z)\) or \(log(E(T \wedge \tau | Z))\)
Fine-Gray model: \(log(-log(1 - C_1(t)))\)
Normally, we could run a Cox model:
# Fit Cox model for hazard ratiocoxph(Surv(time, status) ~ sex, data = lung) %>%# Get model summarysummary()
Call:
coxph(formula = Surv(time, status) ~ sex, data = lung)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
sex -0.5310 0.5880 0.1672 -3.176 0.00149 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
sex 0.588 1.701 0.4237 0.816
Concordance= 0.579 (se = 0.021 )
Likelihood ratio test= 10.63 on 1 df, p=0.001
Wald test = 10.09 on 1 df, p=0.001
Score (logrank) test = 10.33 on 1 df, p=0.001
We could also use the pseudo-observations with GEEs:
# Fit GEE with pseudo-observations for hazard ratiofit <-geese(1- pseudo_observation ~ sex +as.factor(round(month)), data = dat_pseudo, mean.link ="cloglog", id = id, jack =TRUE, scale.fix =TRUE)# Transform estimate to hazard ratioexp(-exp(fit[["beta"]][["sex"]]))
[1] 0.5850911
Call:
geese(formula = 1 - pseudo_observation ~ sex + as.factor(round(month)),
id = id, data = dat_pseudo, mean.link = "cloglog", scale.fix = TRUE,
jack = TRUE)
Mean Model:
Mean Link: cloglog
Variance to Mean Relation: gaussian
Coefficients:
(Intercept) sex as.factor(round(month))1
-3.6918044 -0.6236440 0.7654872
as.factor(round(month))2 as.factor(round(month))3 as.factor(round(month))4
1.3916559 1.8761836 2.1968367
as.factor(round(month))5 as.factor(round(month))6 as.factor(round(month))7
2.5002499 2.8332901 3.0708883
as.factor(round(month))8 as.factor(round(month))9 as.factor(round(month))10
3.2163835 3.3317696 3.5004182
as.factor(round(month))11 as.factor(round(month))12 as.factor(round(month))13
3.6331097 3.7847453 3.8815408
as.factor(round(month))14 as.factor(round(month))15 as.factor(round(month))16
3.9438051 4.0598295 4.1343341
as.factor(round(month))17 as.factor(round(month))18 as.factor(round(month))19
4.2033542 4.3004767 4.3789626
as.factor(round(month))20 as.factor(round(month))21 as.factor(round(month))22
4.4171821 4.4922323 4.5796375
as.factor(round(month))23 as.factor(round(month))24 as.factor(round(month))25
4.6752738 4.8838799 5.0177309
as.factor(round(month))26 as.factor(round(month))27 as.factor(round(month))28
5.0989874 5.1815052 5.1619400
as.factor(round(month))29 as.factor(round(month))32 as.factor(round(month))33
5.1947034 5.3749510 5.2464295
as.factor(round(month))34
5.2464295
Scale is fixed.
Correlation Model:
Correlation Structure: independence
Returned Error Value: 0
Number of clusters: 228 Maximum cluster size: 186
The variance \(\hat{var}(\hat{\beta})\) can be estimated using a sandwich estimator or bootstrapping.
But do not forget: the real practical value lies in the cases where no other standard methods apply.
Visualizing with pseudo-observations
Pseudo-observations can also be used to visualize model assumptions. More in Andersen PK, Perme MP. Stat Methods Med Res. 2010.
We can also use it to create a calibration plot in prediction modelling:
Calibration-in-the-large Calibration slope C statistic
1.107 1.000 0.61 (0.55-0.66)
# Create small Fine-Gray prediction modelmodel_vars <-dev(lung_cr, "Surv(time, status_cr) ~ age + sex + ph.ecog + ph.karno + meal.cal + wt.loss", "fine-gray")# Validate modelpred(lung_cr, "fine-gray", status_cr, time, lpsamp =lpsamp(lung_cr)) %>%# Validatevalidate(observed, pred, lp, "fine-gray", time, histogram_label =0.1, deciles =FALSE)
Further reading
Andersen P.K., Pohar Perme M. Pseudo observations in survival analysis. Stat Methods Med Res 2010.
Pavlic K., Pohar Perme M. Using pseudo-observations for estimation in relative survival. Biostatistics 2018.
Pohar Perme M., Andersen P.K. Checking hazard regression models using pseudo-observations. Stat Med 2008.
Parner E.T., Andersen P.K., Overgaard M. Regression models for censored time-to-event data using infinitesimal jack-knife pseudo-observations, with applications to left-truncation. Lifetime Data Anal 2023.
Ambrogi F., Iacobelli S., Andersen P.K. Analyzing differences between restricted mean survival time curves using pseudo-values. BMC Med Res Methodol 2022.
Andersen P.K., Syriopoulou E., Parner E.T. Causal inference in survival analysis using pseudo-observations Stat Med 2017.
Andersen P.K. Decomposition of number of life years lost according to cause of death Stat Med 2023.
Andersen P.K., Klein J.P., Rosthøj S. Generalized linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 2003.
Graw F., Gerds T.A., Schumacher M. On pseudo-values for regression analysis in multi-state models.
Sabathé C., Andersen P.K., Helmer, C. et al.Regression analysis in an illness-death model with interval-censored data: A pseudo-value approach. Stat Methods Med Res 2020.
Tanaka S., Brookhart M.A., Fine J.P. G-estimation of structural nested mean models for competing risks data using pseudo-observations Biostatistics 2020.