PREDICT - Prediction Practical 1

Analysis and Reporting of Prediction Modelling Studies

Author

Roemer J. Janse & Carolien C.H.M. Maas

Published

November 6, 2025

1 Introduction

In this practical, we will see what are the minimal requirements to develop a prediction model in R (i.e. not from a statistical point of view) and develop a prediction model ourselves.

At the end of this practical, you will know what is required to develop and validate a prediction model in R. Additionally, you will understand the differences between different packages for prediction modelling.

2 Setting up R

Before we get into the fun part, of course we have to load a number of packages that will help us. Note that we use pacman::p_load() for this, instead of library(), because it loads the package, installs it if missing, and it can call multiple packages at once.

# Set seed for random process
set.seed(1)

# Loading packages
pacman::p_load("dplyr",     # Data wrangling
               "ggplot2",   # Data visualization
               "ggExtra",   # Add-ons for ggplot2
               "mice",      # Multiple imputation of data
               "pROC",      # C-statistic for logistic models
               "splines",   # Natural cubic splines
               "survival",  # Developing Cox prediction models
               "rio",       # Importing data
               "rms"        # Developing prediction models and restricted cubic splines
)

Note that we use {pacman} instead of the usual approach with install.packages() and library(), but that this requires a one-time prior installation of {pacman}:

install.packages("pacman")

3 Loading data

For this practical, we will use two datasets.

3.1 Traumatic brain injury (TBI) data

This dataset contains 2,159 patients from the international and US Tirilazad trials (distributed here for didactic purposes only). The primary outcome was the Glasgow Outcome (range 1 through 5) at 6 months.

The TBI dataset was sent to you together with this practical and can be loaded from your local device:

# Specify path of TBI dataset
path <- "C:/users/avid_PREDICT_student/documents/pint/TBI.txt"

# Load TBI data
tbi <- import(path)
Warning in (function (input = "", file = NULL, text = NULL, cmd = NULL, :
Detected 24 column names but the data has 25 columns (i.e. invalid file). Added
an extra default column name for the first column which is guessed to be row
names or an index. Use setnames() afterwards if this guess is not correct, or
fix the file write command that created the file to create a valid file.

The warning we receive is the result of row numbers being present in the .txt file (as also guessed by import()). Given that import() guessed correct, we can ignore this warning and treat the extra column V1 as individual identifiers.

3.2 NCCTG lung cancer data

The NCCTG lung cancer dataset contains 228 patients from the North Central Cancer Treatment group, with information on time until death and performance scores.

The NCCTG lung cancer dataset is provided by the {survival} package and can be loaded from within R:

lung <- survival::lung

More formally, we could call the lung dataset with:

data(cancer)

However, this will show all related cancer datasets available in {survival} as promises (i.e. available to you but not loaded into memory). If you run lung once, it will be loaded into your global environment:

lung

Nonetheless, the other datasets remain visible as promises in your global environment, which can be rather messy.

Below are the codebooks for both the TBI and the lung data.

Codebooks for the TBI and lung datasets
TBI Lung
Variable Description Variable Description
V1 Identifier inst Institution code
trial Trial identification time Survival time (days)
d.gos

Glasgow Outcome Scale at 6 months

  1. dead
  2. vegetative
  3. severe disability
  4. moderate disability
  5. good recovery
status

Censoring status:

  1. Censored
  2. Dead
d.mort Mortality at 6 months age Age (yrs)
d.unfav Unfavourable outcome at 6 months
cause

Cause of injury

  1. road traffic accident
  2. motorbike
  3. assault
  4. domestic/fall
  5. other
ph.ecog

ECOG performance score as rated by the physician

  1. asymptomatic
  2. symptomatic but completely ambulatory
  3. in bed <50% of the day
  4. in bed >50% of the day but not bedbound
  5. bedbound
age Age (yrs) ph.karno Karnofsky performance score rated by the physician (range 0-100)
d.motor Admission motor score (range 1-6) pat.karno Karnofsky performance score as rated by patient
d.pupil

Pupillary reactivity

  1. both reactive
  2. one reactive
  3. none reactive
meal.cal Calories consumed at meals
hypoxia Hypoxia before or at admission
hypotens Hypotension before or at admission
ctclass Marshall CT classification (range 1-6)
tsah tSAH at CT
edh EDH at CT
cisterns

Compressed cisterns at CT

  1. no
  2. slightly
  3. fully
shift Midline shift >5 mm at CT
glucose Glucose at admission (mmol/L)
glucoset Truncated glucose values (mmol/L)
ph pH
sodium Sodium (mmol/L)
hb Hemoglobin (g/dL)
hbt Truncated hemoglobin (g/dL)

4 Predicting a binary unfavourable outcome

In this practical, we will develop a prediction model for the risk of an unfavourable outcome after a TBI (d.unfav). As a first step, let’s better understand our data.

4.1 Getting to know our data

Suppose we consulted some TBI experts, who told us that the most important predictors we want to include are motor activity (d.motor), pupillary reactivity (d.pupil), TBI cause (cause), and age (age).

First, let’s inspect the data available on the outcome and predictors chosen based on clinical expertise.

# Summarize key variables from the TBI dataset
# Outcome: d.unfav
# Predictors: d.motor, d.pupil, cause, age
summary(tbi[, c("d.unfav", "d.motor", "d.pupil", "cause", "age")])
    d.unfav          d.motor        d.pupil             cause          
 Min.   :0.0000   Min.   :1.000   Length:2159        Length:2159       
 1st Qu.:0.0000   1st Qu.:3.000   Class :character   Class :character  
 Median :0.0000   Median :4.000   Mode  :character   Mode  :character  
 Mean   :0.3942   Mean   :3.991                                        
 3rd Qu.:1.0000   3rd Qu.:5.000                                        
 Max.   :1.0000   Max.   :6.000                                        
      age       
 Min.   :14.00  
 1st Qu.:22.00  
 Median :30.00  
 Mean   :33.21  
 3rd Qu.:43.00  
 Max.   :79.00  

This shows us that the outcome variable d.unfav is binary, the predictors d.motor has six categories, and age is a continuous variable ranging from 14 to 79 years. Let’s further inspect d.pupil and cause.

table(tbi[["d.pupil"]])

     both reactive no reactive pupils       one reactive 
              1430                327                279 
table(tbi[["cause"]])

              Assault         domestic/fall             Motorbike 
                  134                   370                   420 
                other Road traffic accident 
                  387                   848 

The predictors d.pupil and cause are categorical variables with three and five levels, respectively. Let’s further inspect the outcome. The data summary indicates that these variables are currently stored as character types. To use them effectively in our analysis, we will convert them to factors, which allows for proper handling of categorical information in modeling.

tbi[["d.pupil"]] <- factor(tbi[["d.pupil"]],
                           levels = c("no reactive pupils", "one reactive", "both reactive"),
                           labels = 0:2)
tbi[["cause"]] <- factor(tbi[["cause"]],
                           levels = c("Road traffic accident", "Motorbike", "domestic/fall", "Assault", "other"),
                           labels = 1:5)

For completeness, we will also convert d.motor to a factor, as this is necessary for proper analysis. It is important to always code categorical variables as factors in R. Otherwise, they may be treated as continuous variables in the model, resulting in a single coefficient that assumes equal effects between levels, an assumption that is typically inappropriate for categorical data.

tbi[["d.motor"]] <- as.factor(tbi[["d.motor"]])

Let’s further inspect the outcome.

# Counts of non-cases and cases
table(tbi[["d.unfav"]])

   0    1 
1308  851 
# Distribution of non-cases and cases (as %)
proportions(table(tbi[["d.unfav"]])) * 100

      0       1 
60.5836 39.4164 

Thus, we have 851 cases (39.4%). The likelihood that a prediction model is overfitted depends partly on the number of cases and non-cases. Fortunately, the number of cases is quite substantial in our dataset. Using the predictors suggested by the clinical expert, we estimate 13 coefficients in the model, including the intercept. This is because motor activity, pupillary reactivity, and cause are categorical variables, and each uses one reference category. You can verify the number of coefficients in the next section. Given the relatively high number of events, we are not too concerned about overfitting.

4.2 Dealing with missingness

Let’s inspect the number of missings.

colSums(is.na(tbi[, c("d.unfav", "d.motor", "d.pupil", "cause", "age")]))
d.unfav d.motor d.pupil   cause     age 
      0       0     123       0       0 

This confirms that the outcome variable d.unfav has no missing values. All predictor variables are complete, except for d.pupil, which is missing in 123 cases. For the purposes of this practical exercise, we will apply single imputation for simplicity. Although single imputation is often unsatisfactory, for our current didactic purposes it is enough.

1single.imputation <- mice::mice(tbi[, c("d.unfav",
                                        "d.motor", 
                                        "d.pupil",
                                        "cause",
                                        "age")],
2                                m = 1,
3                                seed = 1)
1
We include only the necessary variables in the imputation model. Note that these variables may differ from those in your (initial) prediction model.
2
For now, we’ve set the number of imputations to one, but in practice, this should typically be greater than one.
3
Because imputation is a random process, we set a seed to ensure reproducible results. Without a seed, each run would produce different imputed values.

 iter imp variable
  1   1  d.pupil
  2   1  d.pupil
  3   1  d.pupil
  4   1  d.pupil
  5   1  d.pupil

R is teling us that it performed a single imputation with five iterations on the variable d.pupil. These iterations are used by the mice algorithm to refine and stabilize the imputed values through chained equations before finalizing the result. To be able to use the single imputed data set in further analysis, we need to extract it form the mice object.

# extract the single imputation from the mice object
imputed.tbi <- mice::complete(single.imputation, 1)

4.3 Method I: glm()

4.3.1 Fit an initial prediction model

We are now ready to develop our prediction model. Since the outcome variable is binary, we will use logistic regression. Logistic regression belongs to the family of generalized linear models (GLMs), so we can use the glm() function, as demonstrated earlier. By specifying the family = binomial, we indicate that the model should use a logistic link function appropriate for binary outcomes. For didactic purposes, let’s first use the data set with missing values for d.pupil.

# Fit the prediction model
fit.glm.tbi <- stats::glm(
1  d.unfav ~
2    age +
    d.motor +
    d.pupil +
    cause,
  data = tbi,
3  family = binomial)
1
Fit a logistic regression model with the outcome d.unfav.
2
Add age as a linear predictor for now (we will inspect non-linearity next).
3
Specify that the data source and the family to define a logistic model.

Let’s see what the output looks like:

# Print output
summary(fit.glm.tbi)

Call:
stats::glm(formula = d.unfav ~ age + d.motor + d.pupil + cause, 
    family = binomial, data = tbi)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.164267   0.624123  -0.263   0.7924    
age          0.037503   0.004037   9.289  < 2e-16 ***
d.motor2     0.648468   0.621264   1.044   0.2966    
d.motor3     0.057519   0.614843   0.094   0.9255    
d.motor4    -0.660139   0.609902  -1.082   0.2791    
d.motor5    -1.304401   0.610827  -2.135   0.0327 *  
d.motor6    -1.628588   0.680168  -2.394   0.0166 *  
d.pupil1    -0.733449   0.185156  -3.961 7.46e-05 ***
d.pupil2    -1.285761   0.148476  -8.660  < 2e-16 ***
cause2      -0.020937   0.145428  -0.144   0.8855    
cause3       0.143078   0.153424   0.933   0.3510    
cause4      -0.183664   0.231363  -0.794   0.4273    
cause5       0.251604   0.146468   1.718   0.0858 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2726.7  on 2035  degrees of freedom
Residual deviance: 2251.5  on 2023  degrees of freedom
  (123 observations deleted due to missingness)
AIC: 2277.5

Number of Fisher Scoring iterations: 4

The attentive reader might notice that in one of the last lines, the following message is printed:

(123 observations deleted due to missingness)

It is important to realize that, like much statistical software, individuals with missing values are excluded by default. However, if the missingness is informative (which is often the case), this likely results in a biased prediction model (and we lose some power). Therefore, we have performed single imputation to deal with missingness.

If we refit the model with the imputed values for d.pupil, we get:

# Fit the prediction model
fit.glm.tbi <- stats::glm(d.unfav ~
                            age +
                            d.motor +
                            d.pupil +
                            cause,
                          data = imputed.tbi,
                          family = binomial)

# Print output
summary(fit.glm.tbi)

Call:
stats::glm(formula = d.unfav ~ age + d.motor + d.pupil + cause, 
    family = binomial, data = imputed.tbi)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.02818    0.61014  -0.046  0.96316    
age          0.03794    0.00389   9.752  < 2e-16 ***
d.motor2     0.45496    0.60584   0.751  0.45268    
d.motor3    -0.08903    0.59987  -0.148  0.88202    
d.motor4    -0.77002    0.59508  -1.294  0.19567    
d.motor5    -1.40657    0.59582  -2.361  0.01824 *  
d.motor6    -1.81714    0.66589  -2.729  0.00636 ** 
d.pupil1    -0.77994    0.17883  -4.362 1.29e-05 ***
d.pupil2    -1.29120    0.14347  -9.000  < 2e-16 ***
cause2      -0.03586    0.14089  -0.255  0.79906    
cause3       0.05621    0.14879   0.378  0.70558    
cause4      -0.12692    0.22010  -0.577  0.56418    
cause5       0.26330    0.14136   1.863  0.06251 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2895.5  on 2158  degrees of freedom
Residual deviance: 2410.5  on 2146  degrees of freedom
AIC: 2436.5

Number of Fisher Scoring iterations: 4

The message on observations removed due to missingness is now gone (because no individuals were removed).

4.3.2 Consider non-linearities for continuous variables

Age is a continuous variable, and therefore we should investigate whether adding it as a linear term is sufficient or if we model it non-linearly (read more here). We can do this by modelling the outcome as a function of age and visually inspecting the results.

# Use a logistic regression model to model the outcome as a function of age
fit.glm.tbi.nonlinear <- stats::glm(
1  d.unfav ~ rms::rcs(age, 4) +
    d.motor +
    d.pupil +
    cause,
  data = imputed.tbi,
  family = binomial
)        

# Plot age on log odds scale
2ggplot2::ggplot(
  data = imputed.tbi,
  ggplot2::aes(x = age, y = fit.glm.tbi.nonlinear[["linear.predictors"]])
) +
3  ggplot2::geom_point() +
4  ggplot2::geom_smooth(
    method = "loess",
    se = TRUE,
    color = "blue",
    fill = "lightblue",
    formula = "y ~ x"
  ) +
5  ggplot2::xlab("Age (yrs)") + ggplot2::ylab("Log odds") +
6  ggplot2::theme_bw()
1
Fit a logistic regression model with the outcome d.unfav as a non-linear transformation of age and including all other predictors using restricted cubic splines (read more about rcs here).
2
Create the basis for a plot, where age is on the x-axis and the linear predictor (linear.predictors) from the logistic regression model are on the y-axis.
3
Draw points in the plot based on the data.
4
Add a smoothed curve to the plot using a locally estimated scatterplot smoothing (LOESS) curve.
5
Change the axes labels.
6
Change the look of the plot using a pre-specified theme.

This plot is not the most visually appealing; an improved alternative is presented in the following section.

From the figure we observe that age is somewhat linearly associated with the outcome, even if we allow for non-linearities. Still, we want to formally test whether a non-linear model for age might be more appropriate.

# perform likelihood ratio test between the nested models
anova(fit.glm.tbi, fit.glm.tbi.nonlinear, test = "Chisq")
Analysis of Deviance Table

Model 1: d.unfav ~ age + d.motor + d.pupil + cause
Model 2: d.unfav ~ rms::rcs(age, 4) + d.motor + d.pupil + cause
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      2146     2410.5                     
2      2144     2406.8  2   3.7058   0.1568

The likelihood ratio test (p-value = 0.1568) indicates there is no significant linearity. Thus, we do not need to apply any further non-linear transformation to age.

Please note, when fitting the non-linear model, we use three coefficients:

summary(fit.glm.tbi.nonlinear)

Call:
stats::glm(formula = d.unfav ~ rms::rcs(age, 4) + d.motor + d.pupil + 
    cause, family = binomial, data = imputed.tbi)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            0.990484   0.806998   1.227  0.21968    
rms::rcs(age, 4)age   -0.009789   0.024986  -0.392  0.69523    
rms::rcs(age, 4)age'   0.281545   0.154575   1.821  0.06854 .  
rms::rcs(age, 4)age'' -0.520271   0.295252  -1.762  0.07805 .  
d.motor2               0.430456   0.606625   0.710  0.47796    
d.motor3              -0.113539   0.600783  -0.189  0.85010    
d.motor4              -0.792502   0.595839  -1.330  0.18350    
d.motor5              -1.427317   0.596828  -2.392  0.01678 *  
d.motor6              -1.854870   0.666942  -2.781  0.00542 ** 
d.pupil1              -0.760307   0.179220  -4.242 2.21e-05 ***
d.pupil2              -1.289541   0.143464  -8.989  < 2e-16 ***
cause2                -0.055930   0.141515  -0.395  0.69268    
cause3                 0.040592   0.149166   0.272  0.78553    
cause4                -0.118230   0.221658  -0.533  0.59376    
cause5                 0.253768   0.141560   1.793  0.07303 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2895.5  on 2158  degrees of freedom
Residual deviance: 2406.8  on 2144  degrees of freedom
AIC: 2436.8

Number of Fisher Scoring iterations: 4

Thus, while the linear model (fit) initially used 13 coefficients, the non-linear model employed 15. Even though we ultimately choose the linear model, it’s important to note that 15 degrees of freedom were used to explore potential non-linear effects. Therefore, when assessing the risk of overfitting, we should account for all 15 degrees of freedom.

4.3.3 Validating the prediction model

Now that we made a prediction model, we have to make sure it actually works. To do this, we compare the observed and the predicted values. To this end, we will have to somehow get the predicted values from the prediction model.

There are two easy ways to do that:

# Get predicted values from model fit object and add to imputed tbi data
1imputed.tbi[["preds"]] <- fit.glm.tbi[["fitted.values"]]

# Get predicted values from function
2preds_fun <- predict(fit.glm.tbi, type = "response")

# Compare values
3table(imputed.tbi[["preds"]] == preds_fun)
1
The predicted values are already stored in the model fit object. This makes it easy for us to extract and use those values.
2
We can also use the predict function. The predict function contains many subclasses that recognize what type of model we fitted. In this case, R automatically detects it should use the subclass predict.glm(). By defining ‘response’, we tell R that we want the predicted probabilities. The advantage of predict() is that we can also supply the newdata argument. By supplying newdata with a data frame with column names corresponding to the predictors, the predict() function calculates predictions for this new data.
3
We can see that the predictions are the same for each individual.

TRUE 
2159 

With these predictions, we can plot an ROC curve and calculate the AUC, which is equivalent to the C-statistic for binary outcomes.

# create ROC object
ROC_obj <- pROC::roc(response = imputed.tbi[["d.unfav"]],
                     predictor = imputed.tbi[["preds"]])
3
Flips the x-axis from 1→0 to 0→1 using legacy.axes = TRUE.
4
Add the 45-degree line.
5
Add the AUC in the Figure.
6
Change the axes labels.
7
Change the look of the plot using a pre-specified theme.
Setting levels: control = 0, case = 1
Setting direction: controls < cases
# Calculate AUC
AUC <- pROC::auc(ROC_obj)
cat("AUC:", round(AUC, 3))

# plot the ROC curve 
3pROC::ggroc(ROC_obj, legacy.axes = TRUE) +
4  ggplot2::geom_abline(linetype = "dashed", color = "gray") +
5  ggplot2::annotate("text", x = 0.6, y = 0.2,
           label = paste("AUC =", round(AUC, 3)),              
           size = 5) +
6  ggplot2::labs(title = "ROC Curve",
       x = "False Positive Rate (1 - Specificity)",
       y = "True Positive Rate (Sensitivity)") +
7  ggplot2::theme_bw()

AUC: 0.765

Thus, the discrimination of our model is is 0.765.

To calculate our calibration intercept and calibration slope, we will need the to extract the linear predictor from the fit object.

# Get linear predictors from model fit for calibration intercept and slope
1imputed.tbi[["lps"]] <- fit.glm.tbi[["linear.predictors"]]

# Calculate calibration intercept
2fit.offest.lp <- glm(d.unfav ~ offset(lps),
                     data = imputed.tbi, family = binomial)                    
3cal.int <- fit.offest.lp[["coefficients"]][["(Intercept)"]]

# Calculate calibration slope
4fit.lp <- glm(d.unfav ~ lps,
               data = imputed.tbi, family = binomial) 
5cal.slope <- fit.lp[["coefficients"]][["lps"]]

# Print intercept and slope
6cat("Intercept:", format(round(cal.int, 3), nsmall = 3),
  "\nSlope:", format(round(cal.slope, 3), nsmall = 3))
1
As with getting the predicted values, we can use the model fit, or use predict(fit, type = "link"), which has the advantage of allowing new data.
2
To calculate the calibration intercept, we fit a model using the outcome while treating the original linear predictor as an offset. This forces its coefficient to remain fixed at one.
3
The intercept from the model in step 2 is the calibration intercept.
4
To calculate the calibration slope, we fit a model predicting the outcome using the linear predictor.
5
The coefficient of the linear predictor of the model in step 2 is the calibration slope.
6
Print the intercept and the slope (i.e. the regression coefficient for the linear predictors). round() rounds the values and format() prints the decimals up to the value supplied in nsmall, even if they are only 0’s.
Intercept: 0.000 
Slope: 1.000

As expected, the calibration intercept is 0 and the calibration slope is 1, since the model is being validated on the same data it was developed with. If these values differ, it suggests an issue with the model fitting process. In unseen data, the calibration intercept and slope are expected to differ from zero and one, respectively.

Finally, we can create a calibration plot:

# Create calibration plot
ggplot2::ggplot(imputed.tbi, 
1                ggplot2::aes(x = preds, y = d.unfav)) +
2  ggplot2::geom_abline(
    linewidth = 1,
    colour = "black",
    alpha = 0.33
  ) +
3  ggplot2::geom_point(alpha = 0.33) +
4  ggplot2::geom_smooth(
    colour = "#785EF0",
    fill = "#785EF0",
    method = "loess",
    formula = "y ~ x"
  ) +
5  ggplot2::xlab("Predicted probability") +
  ggplot2::ylab("Observed probability") +
6  ggplot2::theme_bw()
1
We create a base layer with the data for the calibration plot.
2
We add the unity line that shows perfect calibration, which is 67% transparent (alpha = 0.33).
3
We draw points on the plot that represent the predicted vs. observed values for each individual.
4
We draw a locally estimated scatterplot smoothing (LOESS) curve which shows the calibration over the range of predicted values.
5
We define the labels for the axes.
6
We change the look of the plot based on a prespecified theme.

We can also add the histogram of predicted probabilities using ggMarginal() from {ggExtras}.

# Create calibration plot
1p <- ggplot2::ggplot(imputed.tbi, ggplot2::aes(
  x = preds,
  y = d.unfav,
  colour = as.factor(d.unfav)
)) + 
  ggplot2::geom_abline(linewidth = 1,
              colour = "black",
              alpha = 0.33) +
  ggplot2::geom_point(alpha = 0.33) +
  ggplot2::geom_smooth(
    colour = "#785EF0",
    fill = "#785EF0",
    method = "loess",
    formula = "y ~ x"
  ) +
2  ggplot2::scale_colour_manual(
    values = c("#648FFF", "#FFB000"),
    labels = c("Favourable outcome", "Unfavourable outcome")
  ) +
  ggplot2::xlab("Predicted probability") +
  ggplot2::ylab("Observed probability") +
  ggplot2::theme_bw() +
3  ggplot2::theme(
    legend.position = "bottom",
    legend.title = ggplot2::element_blank()
  )

# Add histograms
4ggExtra::ggMarginal(
  p,
  type = "histogram",
  margins = "x",
  binwidth = 0.01,
  groupFill = TRUE
) 
1
To have two separately coloured histograms for individuals with and without the outcome, we need to specify on what variable to stratify colour. As a side effect, this also colours the points drawn with geom_point().
2
With a manual scale, we define the colours and labels corresponding to each colour.
3
We slightly adjust the theme by moving the legend to the bottom of the plot and removing the title of the legend.
4
We add the histograms to the plot.

4.4 Method II: lrm()

We now used the glm() function from one of the base R packages ({stats}). However, there is another package that is often used to develop prediction models: {rms}.

4.4.1 Consider non-linearities for continuous variables

To develop our prediction model using {rms}, we use the function lrm() for a logistic regression model. Note that we are directly using the imputed data set. First, let’s check if age should be modeled non-linearly. The {rms} package requires to tell R what the data distributions are using datadist().

# define data distributions
dd <- rms::datadist(imputed.tbi)
options(datadist="dd")

# fit a model allowing age to be non-linearly related to the outcome
fit.rms.tbi.nonlinear <- rms::lrm(
  d.unfav ~ rms::rcs(age, 4) +
    d.motor +
    d.pupil +
    cause,
  data = imputed.tbi,
1  x = TRUE, y = TRUE)
1
We have to specify x and y as TRUE to use the functions to validate our data using the {rms}.

We can obtain the hazard ratios from an rms object very easily.

1summary(fit.rms.tbi.nonlinear)
2summary(fit.rms.tbi.nonlinear, age = c(20, 30))
3summary(fit.rms.tbi.nonlinear, age = c(30, 40))
1
By default, hazard ratios are calculated for a change from the 25th to the 75th percentile. This is why we used datadist() to inform R about the data distribution.
2
You can specify a custom range for the odds ratio. For example, changing age from 20 to 30 gives an odds ratio of 1.2356.
3
Because age was modeled non-linearly, the odds ratio for a change from 30 to 40 is different, namely 1.7141.
             Effects              Response : d.unfav 

 Factor        Low High Diff. Effect    S.E.    Lower 0.95 Upper 0.95
 age           22  43   21     0.892280 0.13817  0.621470   1.16310  
  Odds Ratio   22  43   21     2.440700      NA  1.861700   3.19980  
 d.motor - 1:5  5   1   NA     1.427300 0.59683  0.257560   2.59710  
  Odds Ratio    5   1   NA     4.167500      NA  1.293800  13.42400  
 d.motor - 2:5  5   2   NA     1.857800 0.17196  1.520700   2.19480  
  Odds Ratio    5   2   NA     6.409500      NA  4.575600   8.97830  
 d.motor - 3:5  5   3   NA     1.313800 0.14598  1.027700   1.59990  
  Odds Ratio    5   3   NA     3.720200      NA  2.794500   4.95250  
 d.motor - 4:5  5   4   NA     0.634820 0.12540  0.389040   0.88059  
  Odds Ratio    5   4   NA     1.886700      NA  1.475600   2.41230  
 d.motor - 6:5  5   6   NA    -0.427550 0.32365 -1.061900   0.20679  
  Odds Ratio    5   6   NA     0.652100      NA  0.345800   1.22970  
 d.pupil - 0:2  3   1   NA     1.289500 0.14346  1.008400   1.57070  
  Odds Ratio    3   1   NA     3.631100      NA  2.741100   4.81010  
 d.pupil - 1:2  3   2   NA     0.529230 0.14372  0.247550   0.81092  
  Odds Ratio    3   2   NA     1.697600      NA  1.280900   2.25000  
 cause - 2:1    1   2   NA    -0.055930 0.14152 -0.333290   0.22143  
  Odds Ratio    1   2   NA     0.945610      NA  0.716560   1.24790  
 cause - 3:1    1   3   NA     0.040592 0.14917 -0.251770   0.33295  
  Odds Ratio    1   3   NA     1.041400      NA  0.777420   1.39510  
 cause - 4:1    1   4   NA    -0.118230 0.22166 -0.552670   0.31621  
  Odds Ratio    1   4   NA     0.888490      NA  0.575410   1.37190  
 cause - 5:1    1   5   NA     0.253770 0.14156 -0.023683   0.53122  
  Odds Ratio    1   5   NA     1.288900      NA  0.976600   1.70100  

             Effects              Response : d.unfav 

 Factor        Low High Diff. Effect    S.E.    Lower 0.95 Upper 0.95
 age           20  30   10     0.211590 0.10493  0.0059241  0.41726  
  Odds Ratio   20  30   10     1.235600      NA  1.0059000  1.51780  
 d.motor - 1:5  5   1   NA     1.427300 0.59683  0.2575600  2.59710  
  Odds Ratio    5   1   NA     4.167500      NA  1.2938000 13.42400  
 d.motor - 2:5  5   2   NA     1.857800 0.17196  1.5207000  2.19480  
  Odds Ratio    5   2   NA     6.409500      NA  4.5756000  8.97830  
 d.motor - 3:5  5   3   NA     1.313800 0.14598  1.0277000  1.59990  
  Odds Ratio    5   3   NA     3.720200      NA  2.7945000  4.95250  
 d.motor - 4:5  5   4   NA     0.634820 0.12540  0.3890400  0.88059  
  Odds Ratio    5   4   NA     1.886700      NA  1.4756000  2.41230  
 d.motor - 6:5  5   6   NA    -0.427550 0.32365 -1.0619000  0.20679  
  Odds Ratio    5   6   NA     0.652100      NA  0.3458000  1.22970  
 d.pupil - 0:2  3   1   NA     1.289500 0.14346  1.0084000  1.57070  
  Odds Ratio    3   1   NA     3.631100      NA  2.7411000  4.81010  
 d.pupil - 1:2  3   2   NA     0.529230 0.14372  0.2475500  0.81092  
  Odds Ratio    3   2   NA     1.697600      NA  1.2809000  2.25000  
 cause - 2:1    1   2   NA    -0.055930 0.14152 -0.3332900  0.22143  
  Odds Ratio    1   2   NA     0.945610      NA  0.7165600  1.24790  
 cause - 3:1    1   3   NA     0.040592 0.14917 -0.2517700  0.33295  
  Odds Ratio    1   3   NA     1.041400      NA  0.7774200  1.39510  
 cause - 4:1    1   4   NA    -0.118230 0.22166 -0.5526700  0.31621  
  Odds Ratio    1   4   NA     0.888490      NA  0.5754100  1.37190  
 cause - 5:1    1   5   NA     0.253770 0.14156 -0.0236830  0.53122  
  Odds Ratio    1   5   NA     1.288900      NA  0.9766000  1.70100  

             Effects              Response : d.unfav 

 Factor        Low High Diff. Effect    S.E.    Lower 0.95 Upper 0.95
 age           30  40   10     0.538920 0.10936  0.324570   0.75326  
  Odds Ratio   30  40   10     1.714100      NA  1.383400   2.12390  
 d.motor - 1:5  5   1   NA     1.427300 0.59683  0.257560   2.59710  
  Odds Ratio    5   1   NA     4.167500      NA  1.293800  13.42400  
 d.motor - 2:5  5   2   NA     1.857800 0.17196  1.520700   2.19480  
  Odds Ratio    5   2   NA     6.409500      NA  4.575600   8.97830  
 d.motor - 3:5  5   3   NA     1.313800 0.14598  1.027700   1.59990  
  Odds Ratio    5   3   NA     3.720200      NA  2.794500   4.95250  
 d.motor - 4:5  5   4   NA     0.634820 0.12540  0.389040   0.88059  
  Odds Ratio    5   4   NA     1.886700      NA  1.475600   2.41230  
 d.motor - 6:5  5   6   NA    -0.427550 0.32365 -1.061900   0.20679  
  Odds Ratio    5   6   NA     0.652100      NA  0.345800   1.22970  
 d.pupil - 0:2  3   1   NA     1.289500 0.14346  1.008400   1.57070  
  Odds Ratio    3   1   NA     3.631100      NA  2.741100   4.81010  
 d.pupil - 1:2  3   2   NA     0.529230 0.14372  0.247550   0.81092  
  Odds Ratio    3   2   NA     1.697600      NA  1.280900   2.25000  
 cause - 2:1    1   2   NA    -0.055930 0.14152 -0.333290   0.22143  
  Odds Ratio    1   2   NA     0.945610      NA  0.716560   1.24790  
 cause - 3:1    1   3   NA     0.040592 0.14917 -0.251770   0.33295  
  Odds Ratio    1   3   NA     1.041400      NA  0.777420   1.39510  
 cause - 4:1    1   4   NA    -0.118230 0.22166 -0.552670   0.31621  
  Odds Ratio    1   4   NA     0.888490      NA  0.575410   1.37190  
 cause - 5:1    1   5   NA     0.253770 0.14156 -0.023683   0.53122  
  Odds Ratio    1   5   NA     1.288900      NA  0.976600   1.70100  

Furthermore, we can easily plot the effects of the predictors using plot(Predict()).

# plot non-linearities
plot(Predict(fit.rms.tbi.nonlinear))

From the figure we see that age is somewhat linearly associated with the outcome, even if we allow for non-linearities. If we want to formally test for this, we can use the anova() function.

# test for non-linearities
anova(fit.rms.tbi.nonlinear)
                Wald Statistics          Response: d.unfav 

 Factor     Chi-Square d.f. P     
 age         99.16      3   <.0001
  Nonlinear   3.72      2   0.1553
 d.motor    159.78      5   <.0001
 d.pupil     83.82      2   <.0001
 cause        4.99      4   0.2884
 TOTAL      369.68     14   <.0001

The anova() function gives the result of a Wald test, which is similar but slightly different from a likelihood ratio test which we performed before. The Wald test (p-value = 0.2884) indicates there is no significant linearity. Thus, we do not need to apply any further non-linear transformation to age.

# fit a model allowing age to be linearly related to the outcome
fit.rms.tbi.linear <- rms::lrm(
  d.unfav ~ age +
    d.motor +
    d.pupil +
    cause,
  data = imputed.tbi,
  x = TRUE,
  y = TRUE
)  

4.4.2 Validating the prediction model

The {rms} package offers an easy function validate prediction models: validate().

# Validate the model
val.model <- rms::validate(
  fit.rms.tbi.linear,
1  method = "boot",
2  B = 40,
3  bw = FALSE)
1
The validate() function was configured to use bootstrapping to estimate optimism. A bootstrap sample is created by resampling the original dataset with replacement until it reaches the same size as the original. For each bootstrap sample, the model is refitted and performance metrics (e.g., AUC) are calculated and compared to the original data. This process is repeated and the results are aggregated across all bootstraps. This provides insight in how the model might perform on slightly different data.
2
We set the number of bootstraps to 40, meaning 40 bootstrap samples are generated, 40 models are fitted, and their performance metrics are combined across these 40 iterations.
3
Currently, we’ve used the default settings for validate(). However, if you move on to more advanced modeling and apply backward selection, ensure that bw = TRUE, you specify the stopping rule (e.g., rule = "aic"), and the selection criterion (e.g., only include predictors with a p-value below 0.05: sls = 0.05). This is important because the selection process needs to be repeated within each bootstrap sample to properly account for optimism.

How do we interpret this output? The validate() function applies resampling, which is a method to account for overfitting, with the results being printed in the index.corrected column. However, if we are just interested in the apparent performance, we can just look at index.orig. To calculate the C-statistic, we can use \(c = \frac{Dxy}{2} + 0.5\). In our case, the C-statistic would be:

cat(" Apparent C-statistic", 
    round(val.model["Dxy", "index.orig"]/2 + 0.5, 3), "\n", 
    "Optimism-corrected C-statistic", 
    round(val.model["Dxy", "index.corrected"]/2 + 0.5, 3))
 Apparent C-statistic 0.765 
 Optimism-corrected C-statistic 0.759

The intercept and slope are given under ‘Intercept’ and ‘Slope’. To get the calibration plot, we can use the calibrate() function (which aims to give overfitting-corrected estimates):

# Optimism-corrected calibration plot
plot(rms::calibrate(fit.rms.tbi.linear))


n=2159   Mean absolute error=0.005   Mean squared error=3e-05
0.9 Quantile of absolute error=0.009

Another useful function val.prob() gives the calibration plot with all the performance measures:

# Calibration plot
1rms::val.prob(p = plogis(predict(fit.rms.tbi.linear, type = "lp")),
              y = imputed.tbi[["d.unfav"]])                 
1
The linear predictors (LP) are first obtained from the fit_rms object. Subsequently, the plogis function applies the inverse logit transformation (\[P(Y = 1 \mid X) = \text{logistic}(LP) = \frac{1}{1 + e^{-LP}}\]) to the linear predictor to obtain probabilities.

          Dxy       C (ROC)            R2             D      D:Chi-sq 
 5.298920e-01  7.649460e-01  2.724802e-01  2.241995e-01  4.850466e+02 
          D:p             U      U:Chi-sq           U:p             Q 
 0.000000e+00 -9.263548e-04 -3.183231e-12  1.000000e+00  2.251258e-01 
        Brier     Intercept         Slope          Emax           E90 
 1.880106e-01 -6.394705e-16  1.000000e+00  9.129085e-03  6.044859e-03 
         Eavg           S:z           S:p 
 3.120131e-03 -4.152230e-02  9.668795e-01 

4.5 Method I or Method II?

So why would we pick one method over the other? Well, both methods have their advantages and disadvantages. It may be clear that {rms} has more native (i.e. inherent in the package) support for prediction modelling in the terms of validation with functions such as validate(), calibate(), and val.prob(). Additionally, it makes it easy to get optimism-corrected performance measures. However, {rms} forces you to keep using the functions from its package, which are not always user intuitive. Moreover, only a few packages have been created as add-ons to {rms}, which makes it difficult to integrate into other analyses, or analyses that were not implemented/supported by the author. Conversely, glm() is part of the {stats} package, one of the base packages of R. A huge amount of packages have been developed which built on results from glm(). Additionally, as a base function of R, there is more focus on intuitive use and error-proofing.

4.6 The curse of the model fit object

Whether we develop a prediction model using glm() or lrm(), functions such as predict(), validate(), and calibrate(), require us to supply the model fit object: the R object which contains the fitted model. This makes it much easier for the developer of the prediction model (who has the model fit object) to determine performance. However, validation becomes much more difficult. Either the validator needs to be supplied with the model fit object (meaning a certain level of skill in the corresponding programming language of the model fit object is required), or the validator needs to use functions that do not require the model fit.

Because the model fit object is often not supplied, the validator has to resort to the functions that do not require the model fit. For the C-statistic, this could be roc() from the {pROC} package. However, to get the linear predictors, the modeller could use predict(), but the validator will have to calculate these based on the coefficients. This can mean a lot of (unnecessary) manual work, including the risk of error when writing down the predictor coefficients.

5 Predicting survival in advanced lung cancer

We will now also develop a survival prediction model based on Cox regression using the lung dataset. We will use the {survival} package, which is part of the core R packages and which is closely related to the glm() function. As a second method, we will use the cph() function from {rms} package.

5.1 Getting to know our data

We (again hypothetically) consulted some lung cancer experts, who told us that the most important predictors we want to include are age (age), biological sex (sex), and weight loss in the last six months (wt.loss) and that we should predict death within two years.

First, let’s inspect the number of missing values.

# remove previous variables from other data set
rm(list=ls())

# Set seed for random process
set.seed(1)

# load data from survival package
lung <- survival::lung[, c("time", "status", "age", "sex", "wt.loss")]
summary(lung)
      time            status           age             sex       
 Min.   :   5.0   Min.   :1.000   Min.   :39.00   Min.   :1.000  
 1st Qu.: 166.8   1st Qu.:1.000   1st Qu.:56.00   1st Qu.:1.000  
 Median : 255.5   Median :2.000   Median :63.00   Median :1.000  
 Mean   : 305.2   Mean   :1.724   Mean   :62.45   Mean   :1.395  
 3rd Qu.: 396.5   3rd Qu.:2.000   3rd Qu.:69.00   3rd Qu.:2.000  
 Max.   :1022.0   Max.   :2.000   Max.   :82.00   Max.   :2.000  
                                                                 
    wt.loss       
 Min.   :-24.000  
 1st Qu.:  0.000  
 Median :  7.000  
 Mean   :  9.832  
 3rd Qu.: 15.750  
 Max.   : 68.000  
 NA's   :14       

Some survival times go beyond two years, so let’s cap those survival times.

# Cap survival times to two years
lung <- lung |>
    # Change variables
    dplyr::mutate(# Set status to censored if we observed individuals longer than two years 
1           status = ifelse(time > 365.25 * 2, 1, status),
           # Cap time to two years 
2           time = ifelse(time > 365.25 * 2, 365.25 * 2, time))
1
If individuals died after two years, we did not observe an event within two years, therefore they become censored. Two years is defined by 365.25 * 2.
2
We change the time until event to a maximum of two years. If the time is already shorter than two years, we keep that time.

Note that we use 365.25 days in a year. Although formally this should be 365.24, the second decimal is negligible for most follow-up durations in medical studies.

The degree to which a prediction model is likely not overfitted depends in part on the number of cases and non-cases:

# First, recode status to 0 and 1
lung[["status"]] <- lung[["status"]] - 1

# Counts of non-cases and cases
table(lung[["status"]])

  0   1 
 69 159 
# Distribution of non-cases and cases (as %)
proportions(table(lung[["status"]])) * 100

       0        1 
30.26316 69.73684 

In the lung dataset, we can see that we have more cases than non-cases. In this case, we look at the number of non-cases to get an idea of the risk of overfitting. Given the already small sample size and the small number of non-cases (n = 69, 30.3%), we likely have a high risk of overfitting. Given the scope of this practical, we will leave that be for now, but recommend that in real research settings, you evaluate this risk and adjust your statistical analysis accordingly.

5.2 Dealing with missingness

Because weight loss had some missings, we used single imputation. Again, given the didactic purpose of this practical, this is fine, but in a real research setting, multiple imputation should be used.

# Single imputation of data
1imputed.lung <- mice::complete(mice::mice(lung, m = 1, seed = 1))
1
As explained above, we immediately extract the single imputation from the mice object, which was configured for one imputation and a fixed seed to ensure reproducibility.

 iter imp variable
  1   1  wt.loss
  2   1  wt.loss
  3   1  wt.loss
  4   1  wt.loss
  5   1  wt.loss

5.3 Consider non-linearities for continuous variables

Because age and weight loss are continuous, we should investigate whether adding them as a linear term is sufficient or if we should model them non-linearly. This time, we will first use the rms package, and the alternative survival package will be used subsequently.

5.4 Method I: cph()

# define data distributions
dd <- rms::datadist(imputed.lung)
options(datadist="dd")

# fit a model allowing age to be non-linearly related to the outcome
fit.rms.lung.nonlinear <- rms::cph(
1  survival::Surv(time, status) ~
2    rms::rcs(age, 4) +
    sex +
    rms::rcs(wt.loss, 4),
3  data = imputed.lung)

# show model
4summary(fit.rms.lung.nonlinear)
1
Fit a Cox regression model, with the outcome being a survival object containing the event and time to event.
2
Add predictors non-linear age, linear sex, and non-linear weight loss.
3
We use a single imputed dataset.
4
We present the hazard ratios using default settings.
             Effects              Response : survival::Surv(time, status) 

 Factor        Low High  Diff. Effect    S.E.    Lower 0.95 Upper 0.95
 age           56  69.00 13.00  0.040185 0.21948 -0.38999    0.47036  
  Hazard Ratio 56  69.00 13.00  1.041000      NA  0.67706    1.60060  
 sex            1   2.00  1.00 -0.525160 0.17187 -0.86202   -0.18829  
  Hazard Ratio  1   2.00  1.00  0.591460      NA  0.42231    0.82838  
 wt.loss        0  15.25 15.25  0.249890 0.19062 -0.12372    0.62350  
  Hazard Ratio  0  15.25 15.25  1.283900      NA  0.88363    1.86540  

Instead of the predicted probability, we plot the log relative hazard as a function of the predictor, using the plot(Predict()) function from {rms}.

# plot non-linearities
plot(Predict(fit.rms.lung.nonlinear))

Visually, it seems that age and wt.loss are non-linearly related to the outcome. Let’s formally test this.

# test for non-linearities
anova(fit.rms.lung.nonlinear)
                Wald Statistics          Response: survival::Surv(time, status) 

 Factor          Chi-Square d.f. P     
 age              6.11      3    0.1062
  Nonlinear       2.02      2    0.3636
 sex              9.34      1    0.0022
 wt.loss          4.42      3    0.2200
  Nonlinear       4.34      2    0.1141
 TOTAL NONLINEAR  6.40      4    0.1714
 TOTAL           21.35      7    0.0033

However, the Wald test shows that the non-linear terms are not significant, so age and weight loss can be modeled as linearly.

5.4.1 Validating the prediction model

# fit linear model
fit.rms.lung.linear <- rms::cph(
  survival::Surv(time, status) ~
    age +
    sex +
    wt.loss,
  x = TRUE,
  y = TRUE,
  data = imputed.lung
)          

# internal bootstrap validation
rms::validate(fit.rms.lung.linear)
      index.orig training   test optimism index.corrected  n
Dxy       0.2091   0.2228 0.1918   0.0310          0.1781 40
R2        0.0667   0.0830 0.0588   0.0242          0.0425 40
Slope     1.0000   1.0000 0.8847   0.1153          0.8847 40
D         0.0100   0.0127 0.0087   0.0040          0.0059 40
U        -0.0014  -0.0013 0.0010  -0.0024          0.0010 40
Q         0.0113   0.0141 0.0077   0.0064          0.0049 40
g         0.3795   0.4194 0.3516   0.0679          0.3117 40

5.5 Method II: coxph()

To fit the prediction model, we can use the coxph() function from the {survival} package. This works just like glm() in how we specify the prediction model, except that we specify the outcome using the Surv() function.

# Fit the non-linear prediction model
fit.rms.lung.nonlinear <- survival::coxph(
1  survival::Surv(time, status) ~
2    age + sex +
3    splines::ns(wt.loss, knots = c(1, 12)),
  data = imputed.lung
) 
1
Fit a Cox regression model, with the outcome being a survival object containing the event and time to event.
2
Add predictors age (linearly) and sex.
3
Add weight loss as a natural cubic spline using the ns() function from {splines}. We put knots at 1 and 12 based on the previous graph. The advantage of rcs() is that it does not transform the underlying variable (i.e. not a B-spline matrix basis for a natural cubic spline) and is therefore easier to interpret and write down in a concise formula. However, we are forced to use at least 3 knots meaning more coefficients are estimated, whilst ns() allows us to use a single knot.

Let’s see what the linear model looks like:

# Print output
summary(fit.rms.lung.nonlinear)
Call:
survival::coxph(formula = survival::Surv(time, status) ~ age + 
    sex + splines::ns(wt.loss, knots = c(1, 12)), data = imputed.lung)

  n= 228, number of events= 159 

                                             coef exp(coef)  se(coef)      z
age                                      0.017394  1.017546  0.009493  1.832
sex                                     -0.541317  0.581981  0.171290 -3.160
splines::ns(wt.loss, knots = c(1, 12))1 -0.314830  0.729913  0.483386 -0.651
splines::ns(wt.loss, knots = c(1, 12))2 -3.237030  0.039280  1.603606 -2.019
splines::ns(wt.loss, knots = c(1, 12))3 -1.264345  0.282424  0.930900 -1.358
                                        Pr(>|z|)   
age                                      0.06691 . 
sex                                      0.00158 **
splines::ns(wt.loss, knots = c(1, 12))1  0.51485   
splines::ns(wt.loss, knots = c(1, 12))2  0.04353 * 
splines::ns(wt.loss, knots = c(1, 12))3  0.17440   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                                        exp(coef) exp(-coef) lower .95
age                                       1.01755     0.9828  0.998789
sex                                       0.58198     1.7183  0.416013
splines::ns(wt.loss, knots = c(1, 12))1   0.72991     1.3700  0.283017
splines::ns(wt.loss, knots = c(1, 12))2   0.03928    25.4580  0.001695
splines::ns(wt.loss, knots = c(1, 12))3   0.28242     3.5408  0.045553
                                        upper .95
age                                        1.0367
sex                                        0.8142
splines::ns(wt.loss, knots = c(1, 12))1    1.8825
splines::ns(wt.loss, knots = c(1, 12))2    0.9103
splines::ns(wt.loss, knots = c(1, 12))3    1.7510

Concordance= 0.611  (se = 0.025 )
Likelihood ratio test= 19.33  on 5 df,   p=0.002
Wald test            = 18.66  on 5 df,   p=0.002
Score (logrank) test = 19.19  on 5 df,   p=0.002

We see that weight loss has three coefficients, one for the lowest value through 1 (1), one for 1 through 12 (2), and one for 12 through the highest value (3).

Now, let’s formally test if wt.loss needs to be modelled non-linearly using natural cubic splines.

# Fit the linear prediction model
fit.rms.lung.linear <- survival::coxph(survival::Surv(time, status) ~
                                         age + sex +
                                         wt.loss,
                                       data = imputed.lung) 

# likelihood ratio test
anova(fit.rms.lung.linear, fit.rms.lung.nonlinear, test = "Chisq")
Analysis of Deviance Table
 Cox model: response is  survival::Surv(time, status)
 Model 1: ~ age + sex + wt.loss
 Model 2: ~ age + sex + splines::ns(wt.loss, knots = c(1, 12))
   loglik  Chisq Df Pr(>|Chi|)
1 -729.18                     
2 -727.36 3.6258  2     0.1632

This illustrates that weight loss does not need to be modelled linearly.

In the summary of our model, please note that we do not have an intercept. To calculate the final predicted risk of a Cox proportional hazards model, we need to use the baseline hazard instead of the intercept. However, the baseline hazard is different for each timepoint.

We can use basehaz() to calculate the baseline hazard at our timepoint of interest (2 years):

# Extract baseline hazard at 2 years
h0 <- dplyr::filter(survival::basehaz(fit.rms.lung.linear), 
                         time == 365.25 * 2)[["hazard"]]
print(h0)
[1] 2.137266

Discrimination of this model can be calculated using Harrell’s C-statistic: * The function cindex() from the {dynpred} package (does not require model fit, but does require development formula). * The function rcorr.cens() from the {Hmisc} package (does not require model fit, but may be slow). * The function cIndex() from the {intsurv} package.

The calibration slope can be calculated as before, using coxph() and a calibration can also be drawn using the same methods as discussed before. In the case of competing risks, take note of validation measures for competing risk settings. More on that can be found in Van Geloven et al. 2021 and Ramspek et al. 2021.

5.6 The curse of the model fit object revisited

We already saw that with the model fit object, validation of a prediction model becomes quite easy.

Using predict()

If we use predict() to calculate the survival probability for an individual based on a Cox regression model (thus using predict(fit, type = "survival"), note that the survival probability is not the probability at our specified time horizon, but at the end of their observed follow-up. Therefore, we cannot use this to calculate predictions (see also this issue opened by the author of this practical, who turned out to be wrong but learned along the way)

To get individual predictions, instead use predict(fit, type = "lp") and calculate the survival probability from the linear predictor using \(\hat{y}(t) = 1 - S_0(t)^{e^{LP_i}}\) where \(S_0(t) = e^{-H_0(t)}\) and \(H_0(t)\) is the baseline hazard.

lp <- fit.rms.lung.linear[["linear.predictors"]]
h <- h0*exp(lp)
p <- 1-exp(-h)
hist(p, main="Histogram of predicted probabilities")

However, in the absence of a model fit object, the validator may struggle more, especially for Cox regression models. By default, both coxph() and cph() estimate a centered Cox regression model. This means that the reference individual for the model is not someone without any binary predictor and with all continuous variables at 0, but someone without any binary predictor and with all continuous variables at the mean of that variable. If we then want to calculate an individual’s linear predictor, we also should have the linear predictor of the sample (\(LP_{sample}\)). To calculate an individual’s linear predictor for the final prediction, we substract \(LP_{sample}\) from their individual linear predictor \(LP_{individual}\): \(\hat{y}(t) = 1 - S_0(t)^{e^{LP_{individual} - LP_{sample}}}\). More on this can be found in this blogpost.

6 Final words

We have now taken a look at the basics for prediction modelling in R. Although many more packages are available for our goals (e.g. {riskRegression), we introduced the two most commonly used packages. If you start developing or validating a prediction model on your own, look back to this practical and never be afraid to ask, whether that be an experienced modeller, the internet, or (with due caution applied) a large language model.

7 Postscript

This practical was developed on:

R version 4.5.0 (2025-04-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] rms_8.0-0      Hmisc_5.2-3    rio_1.2.4      survival_3.8-3 pROC_1.19.0.1 
[6] mice_3.18.0    ggExtra_0.11.0 ggplot2_3.5.2  dplyr_1.1.4   

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1   farver_2.1.2       R.utils_2.13.0     fastmap_1.2.0     
 [5] TH.data_1.1-3      pacman_0.5.1       promises_1.3.3     digest_0.6.37     
 [9] rpart_4.1.24       mime_0.13          lifecycle_1.0.4    cluster_2.1.8.1   
[13] magrittr_2.0.3     compiler_4.5.0     rlang_1.1.6        tools_4.5.0       
[17] yaml_2.3.10        data.table_1.17.2  knitr_1.50         labeling_0.4.3    
[21] htmlwidgets_1.6.4  RColorBrewer_1.1-3 multcomp_1.4-28    miniUI_0.1.2      
[25] polspline_1.1.25   withr_3.0.2        foreign_0.8-90     purrr_1.1.0       
[29] R.oo_1.27.1        nnet_7.3-20        grid_4.5.0         jomo_2.7-6        
[33] xtable_1.8-4       colorspace_2.1-1   scales_1.4.0       iterators_1.0.14  
[37] MASS_7.3-65        mvtnorm_1.3-3      cli_3.6.5          rmarkdown_2.29    
[41] reformulas_0.4.1   generics_0.1.4     rstudioapi_0.17.1  minqa_1.2.8       
[45] stringr_1.5.1      base64enc_0.1-3    vctrs_0.6.5        boot_1.3-31       
[49] glmnet_4.1-9       Matrix_1.7-3       sandwich_3.1-1     jsonlite_2.0.0    
[53] SparseM_1.84-2     mitml_0.4-5        Formula_1.2-5      htmlTable_2.4.3   
[57] foreach_1.5.2      tidyr_1.3.1        glue_1.8.0         nloptr_2.2.1      
[61] pan_1.9            codetools_0.2-20   stringi_1.8.7      shape_1.4.6.1     
[65] gtable_0.3.6       later_1.4.4        lme4_1.1-37        tibble_3.2.1      
[69] pillar_1.10.2      htmltools_0.5.8.1  quantreg_6.1       R6_2.6.1          
[73] Rdpack_2.6.4       evaluate_1.0.3     shiny_1.11.1       lattice_0.22-6    
[77] R.methodsS3_1.8.2  rbibutils_2.3      backports_1.5.0    broom_1.0.8       
[81] httpuv_1.6.16      MatrixModels_0.5-4 Rcpp_1.0.14        gridExtra_2.3     
[85] nlme_3.1-168       checkmate_2.3.2    mgcv_1.9-1         xfun_0.52         
[89] zoo_1.8-14         pkgconfig_2.0.3