PREDICT - Prediction Practical 2

Shrinkage Methods and Performance Metrics for Prediction Models

Author

Roemer J. Janse & Carolien C.H.M. Maas

Published

December 8, 2025

1 Introduction

In this practical, we will see how to implement different shrinkage methods, specifically ridge regression, LASSO regression, and elastic net regression. Additionally, we will look at uniform shrinkage using bootstrapping. In this practical you will get more familiar with obtaining performance metrics (i.e., calibration, discrimination, and net benefit).

At the end of this practical, you will know how you can estimate the regularization parameter (\(\lambda\)) and use it to perform these shrinkage-based regressions, as well as perform uniform shrinkage.

2 Set-up

2.1 Packages

As usual, before we get going, we will load some useful packages.

# Loading packages
pacman::p_load("dplyr",         # Data wrangling
               "magrittr",      # More efficient pipelines
               "tidyr",         # Data tidying
               "rio",           # Importing data
               "glmnet",        # Shrinkage methods
               "rms",           # Modelling functions
               "pROC",          # C-statistics
               "knitr",         # Markdown functions
               "ggplot2",       # Data visualization
               "patchwork"      # Patching plots together
)

2.2 Loading data

For this practical, we will again use the Traumatic brain injury (TBI) data. This dataset contains 2,159 patients from the international and US Tirilazad trials (distributed here for didactic purposes only). The primary outcome was the Glasgow Outcome (range 1 through 5) at 6 months.

The TBI dataset was sent to you together with this practical and can be loaded from your local device:

# Specify path of TBI dataset
path <- "C:/users/avid_PINT_student/documents/pint/TBI.txt"

# Load TBI data
tbi <- import(path)

Warning in (function (input = "", file = NULL, text = NULL, cmd = NULL, :
Detected 24 column names but the data has 25 columns (i.e. invalid file). Added
an extra default column name for the first column which is guessed to be row
names or an index. Use setnames() afterwards if this guess is not correct, or
fix the file write command that created the file to create a valid file.

The warning we receive is the result of row numbers being present in the .txt file (as also guessed by import()). Given that import() guessed correct, we can ignore this warning and treat the extra column V1 as individual identifiers.

Below is the codebook for the TBI data.

Codebook for the TBI dataset
	TBI
Variable	Description
V1	Identifier
trial	Trial identification
d.gos	Glasgow Outcome Scale at 6 months dead vegetative severe disability moderate disability good recovery
d.mort	Mortality at 6 months
d.unfav	Unfavourable outcome at 6 months
cause	Cause of injury road traffic accident motorbike assault domestic/fall other
age	Age (yrs)
d.motor	Admission motor score (range 1-6)
d.pupil	Pupillary reactivity both reactive one reactive none reactive
pupil.i	Single imputed pupillary reactivity
hypoxia	Hypoxia before or at admission
hypotens	Hypotension before or at admission
ctclass	Marshall CT classification (range 1-6)
tsah	tSAH at CT
edh	EDH at CT
cisterns	Compressed cisterns at CT no slightly fully
shift	Midline shift >5 mm at CT
d.sysbpt	Systolic blood pressure at admission (mmHg)
glucose	Glucose at admission (mmol/L)
glucoset	Truncated glucose values (mmol/L)
ph	pH
sodium	Sodium (mmol/L)
hb	Hemoglobin (g/dL)
hbt	Truncated hemoglobin (g/dL)

2.3 Preparing our data

Like last time, we will develop a prediction model for the risk of an unfavourable outcome after a TBI. However, we will geographically split the data into a development dataset (individuals from the Tirilazad US trial) and a validation dataset (individuals from the Tirilazad International trial). Additionally, although not recommended, for educational simplicity, we have again used single imputation for missing data.

# Prior to split, set categorical variables to factors
tbi$pupil.i <- as.factor(tbi$pupil.i)
tbi$cause <- as.factor(tbi$cause)

# Create development dataset
development <- filter(tbi, trial == "Tirilazad US")

# Create validation dataset
validation <- filter(tbi, trial == "Tirilazad International")

2.4 Getting to know our data

To get to know our data, we will look at the number of cases in both datasets.

# Counts of non-cases and cases
by(tbi[["d.unfav"]], tbi[["trial"]], table)

tbi[["trial"]]: Tirilazad International

  0   1 
662 456 
------------------------------------------------------------ 
tbi[["trial"]]: Tirilazad US

  0   1 
646 395

# Distribution of non-cases and cases (as %)
by(tbi[["d.unfav"]], tbi[["trial"]], \(x) proportions(table(x)) * 100)

tbi[["trial"]]: Tirilazad International
x
       0        1 
59.21288 40.78712 
------------------------------------------------------------ 
tbi[["trial"]]: Tirilazad US
x
       0        1 
62.05572 37.94428

In the development dataset, we can see that we have quite some cases (n = 395, 37.9%). We have a bit more in the validation dataset (n = 456, 40.8%).

We consulted TBI experts, who told us that the most important predictors we want to include are pupillary reactivity (pupil.i), TBI cause (cause), systolic blood pressure (d.sysbpt), and age (age). We now use the pupil.i variable that is obtained using single impuation. We check if the continuous variables age and d.sysdpt need to be modelled non-linearly This would mean that we have used are a total of 12 parameters, even though we conclude to use linear age and model dsysbpt using a piecewise linear transformation.

# Model using imputed pupil information
dd <- rms::datadist(development)
options(datadist="dd")
non_linear_model <- rms::lrm(
  d.unfav ~                      
    rms::rcs(age, 4) + 
    rms::rcs(d.sysbpt, 4) + 
    pupil.i +
    cause,
  data = development)      

# Indicates non-linearity for systolic blood pressure
anova(non_linear_model)

                Wald Statistics          Response: d.unfav 

 Factor          Chi-Square d.f. P     
 age              38.62      3   <.0001
  Nonlinear        3.00      2   0.2227
 d.sysbpt         24.19      3   <.0001
  Nonlinear       23.01      2   <.0001
 pupil.i         107.79      2   <.0001
 cause             3.93      4   0.4156
 TOTAL NONLINEAR  25.02      4   <.0001
 TOTAL           160.69     12   <.0001

plot(Predict(non_linear_model))

If we were to calculate an events-per-variable (EPV), this would be:

# Number of coefficients used
summary(non_linear_model)

             Effects              Response : d.unfav 

 Factor                                      Low    High   Diff. Effect   
 age                                          23.00  41.00 18.00  0.884730
  Odds Ratio                                  23.00  41.00 18.00  2.422300
 d.sysbpt                                    123.17 143.13 19.96  0.184420
  Odds Ratio                                 123.17 143.13 19.96  1.202500
 pupil.i - no reactive pupils:both reactive    1.00   2.00    NA  1.831500
  Odds Ratio                                   1.00   2.00    NA  6.243400
 pupil.i - one reactive:both reactive          1.00   3.00    NA  0.866100
  Odds Ratio                                   1.00   3.00    NA  2.377600
 cause - Assault:Road traffic accident         5.00   1.00    NA -0.251800
  Odds Ratio                                   5.00   1.00    NA  0.777400
 cause - domestic/fall:Road traffic accident   5.00   2.00    NA  0.072995
  Odds Ratio                                   5.00   2.00    NA  1.075700
 cause - Motorbike:Road traffic accident       5.00   3.00    NA -0.215150
  Odds Ratio                                   5.00   3.00    NA  0.806420
 cause - other:Road traffic accident           5.00   4.00    NA  0.206520
  Odds Ratio                                   5.00   4.00    NA  1.229400
 S.E.    Lower 0.95 Upper 0.95
 0.20000  0.49274   1.27670   
      NA  1.63680   3.58490   
 0.19087 -0.18967   0.55851   
      NA  0.82723   1.74810   
 0.17951  1.47970   2.18340   
      NA  4.39150   8.87610   
 0.21213  0.45033   1.28190   
      NA  1.56880   3.60340   
 0.26283 -0.76694   0.26335   
      NA  0.46443   1.30130   
 0.22052 -0.35922   0.50521   
      NA  0.69822   1.65730   
 0.23124 -0.66837   0.23806   
      NA  0.51255   1.26880   
 0.20083 -0.18709   0.60014   
      NA  0.82937   1.82240

nr_coef <- length(coef(non_linear_model)[-1])
cat("Number of coefficients (excluding intercept):", nr_coef, "\n")

Number of coefficients (excluding intercept): 12

# EPV in both datasets
by(tbi[["d.unfav"]], tbi[["trial"]], \(x) table(x)[["1"]] / nr_coef)

tbi[["trial"]]: Tirilazad International
[1] 38
------------------------------------------------------------ 
tbi[["trial"]]: Tirilazad US
[1] 32.91667

This is not bad, but EPV is also not an ideal measure. More formal sample size calculations are available but are outside the scope of this course.

To investigate the best functional form for systolic blood pressure, let us try a couple models and compare the performance using the BIC (read more here) and look at the effect plot.

# restricted cubic splines
model_1 <- rms::lrm(
  d.unfav ~                      
    age + 
    rms::rcs(d.sysbpt, 4) + 
    pupil.i +
    cause,
  data = development)  
# quadratic 
model_2 <- rms::lrm(
  d.unfav ~                      
    age + 
    pol(d.sysbpt, 2) + 
    pupil.i +
    cause,
  data = development)  
# linear piecewise regression
model_3 <- rms::lrm(
  d.unfav ~                      
    age + 
    lsp(d.sysbpt, 125) + 
    pupil.i +
    cause,
  data = development)

# compare performance
cat("Model performance of model 1:", BIC(model_1), "\n",
    "Model performance of model 2:", BIC(model_2), "\n",
    "Model performance of model 3:", BIC(model_3), "\n")

Model performance of model 1: 1250.81 
 Model performance of model 2: 1244.537 
 Model performance of model 3: 1244.47

# look at effect plots
plot(Predict(model_1))

plot(Predict(model_2))

plot(Predict(model_3))

As the results show, the model performs similarly across specifications. Because we prefer not to impose a specific value for the location of the kink, we will instead use a quadratic term to allow the relationship to vary smoothly.

3 Ridge, LASSO, and elastic net

Please have a look at this post for a visual presentation of the three regularization methods.

3.1 Standard model

To show how shrinkage affects the model, we will first develop a standard logistic prediction model.

# Develop the prediction model
fit_standard <- glm(d.unfav ~
                      age +
                      pol(d.sysbpt, 2) +
                      pupil.i +
                      cause,
                    data = development,
                    family = binomial)     

# See output
summary(fit_standard)


Call:
glm(formula = d.unfav ~ age + pol(d.sysbpt, 2) + pupil.i + cause, 
    family = binomial, data = development)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                15.9992904  3.9739447   4.026 5.67e-05 ***
age                         0.0362897  0.0060717   5.977 2.27e-09 ***
pol(d.sysbpt, 2)d.sysbpt   -0.2751360  0.0596979  -4.609 4.05e-06 ***
pol(d.sysbpt, 2)d.sysbpt^2  0.0010128  0.0002239   4.524 6.06e-06 ***
pupil.ino reactive pupils   1.8228283  0.1791413  10.175  < 2e-16 ***
pupil.ione reactive         0.8350949  0.2109528   3.959 7.54e-05 ***
causedomestic/fall          0.2918318  0.2967658   0.983    0.325    
causeMotorbike              0.0020092  0.3120119   0.006    0.995    
causeother                  0.4525598  0.2865500   1.579    0.114    
causeRoad traffic accident  0.2422338  0.2597648   0.933    0.351    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1382.0  on 1040  degrees of freedom
Residual deviance: 1175.1  on 1031  degrees of freedom
AIC: 1195.1

Number of Fisher Scoring iterations: 4

3.2 Ridge regression

Ridge regression, also known as \(l_2\) regularization, applies a penalty to the maximum likelihood estimator, leading to shrinkage of the regression coefficients, which results in a less optimistic model. In the below formula, the first part shows the maximum likelihood estimator of a logistic model and the second (blue) part the addition of the \(l_2\) regularization:

\[ \log{L} = \sum_iy_i\log{\pi_i}+(1-y_i)\log{(1-\pi_i)}-\color{blue}{\lambda\sum_{j=1}^P\hat{\beta}_j^2} \]

where \(i\) stands for individual \(i\), \(y_i\) is the observed outcome for individual \(i\) and \(\pi_i\) is the predicted outcome for individual i.

We can fit a ridge regression using glmnet() from the {glmnet} package. However, first we require a value for \(\lambda\). The preferred way to determine this is by \(k\)-fold cross-validation (read more here), as implemented in the function cv.glmnet()from {glmnet} package:

1# Prepare predictors
2x_dev <- model.matrix(
  ~ age + pol(d.sysbpt, 2) + pupil.i + cause,                     
  data = dplyr::select(development, age, d.sysbpt, pupil.i, cause)
)
3x_dev <- x_dev[, -1]

4# Prepare outcome
y_dev <- as.matrix(development[, "d.unfav"])

# Define range of lambdas 
5lambdas <- seq(0.001, 10, length.out = 3e4)

# Run cross-validation (this might take a minute)
cv_ridge <- glmnet::cv.glmnet(x = x_dev, 
                              y = y_dev, 
                              family = "binomial",
6                              type.measure = "mse",
                              lambda = lambdas, 
7                              alpha = 0,
8                              nfolds = 10)

# Get optimal lambda
cat(" lambda.min      :", cv_ridge[["lambda.min"]], "\n",
    "log(lambda.min) :", log(cv_ridge[["lambda.min"]]), "\n",
    "lambda.1se      :", cv_ridge[["lambda.1se"]], "\n",
    "log(lambda.1se) :", log(cv_ridge[["lambda.1se"]]), "\n")
9lambda_ridge <- cv_ridge[["lambda.1se"]]

10# Show results of cross-validation
plot(cv_ridge)

1: The function cv.glmnet() requires a matrix with the predictors, where each row corresponds to an observation and each column a predictor.
2: Factors are not allowed in the x matrix, which means that we should already create dummy variables for the factors. The function model.matrix() does this for us.
3: Because glmnet automatically includes an intercept, we remove the first column to not generate an additional intercept.
4: The outcome should also be given as a matrix or vector.
5: We specify our own range of lambda’s we would like to test using cross-validation. Note that glmnet() can also do this on our own, but this may be unreliable.
6: We selected the model based on mean squared error, but other performance metrics can be used as well. See the documentation with ?cv.glmnet for details.
7: We specify alpha = 0 to indicate that we want to use ridge regression. If we use alpha = 1, we indicate LASSO regression. Anything in between would result in elastic net regression.
8: We specify 10-fold cross-validation.
9: The optimal \(\lambda\) is derived by taking lambda.1se from the results, alternatively we could extract lambda.min. Use lambda.min when you want the model with best predictive accuracy. Use lambda.1se when you prefer a simpler, more regularized model that is often more robust. Read more here.
10: We can show that the \(\lambda\) we selected (the dashed line) is the one with the lowest mean squared error. The x-axis is in log scale for clarity.

 lambda.min      : 0.001 
 log(lambda.min) : -6.907755 
 lambda.1se      : 0.01533238 
 log(lambda.1se) : -4.177788

We choose our optimal \(\lambda\) value to be:

lambda_ridge

[1] 0.01533238

Now we can fit our ridge regression model:

# Fit ridge regression
fit_ridge <- glmnet::glmnet(x = x_dev,
                            y = y_dev,
                            family = "binomial",
                            alpha = 0,
                            lambda = lambda_ridge)
                    
# Compare coefficients
data.frame(standard = fit_standard[["coefficients"]],
           ridge = as.numeric(coef(fit_ridge)))

                               standard         ridge
(Intercept)                15.999290431 -6.485339e-01
age                         0.036289668  3.364532e-02
pol(d.sysbpt, 2)d.sysbpt   -0.275136026 -1.699942e-02
pol(d.sysbpt, 2)d.sysbpt^2  0.001012795  3.721493e-05
pupil.ino reactive pupils   1.822828253  1.683370e+00
pupil.ione reactive         0.835094883  7.489008e-01
causedomestic/fall          0.291831751  2.194116e-01
causeMotorbike              0.002009155 -1.027015e-01
causeother                  0.452559793  3.732177e-01
causeRoad traffic accident  0.242233817  1.666817e-01

Comparing these coefficients to the standard model, we may observe that the coefficients are shrunk. If you have a polychotomous predictor (i.e., categorical predictor with more than two levels), not all coefficients for that predictor might shrink.

If we set \(\lambda\) to 0, we can see that it will be practically the same as the standard logistic regression:

# Fit ridge regression with lambda 0
fit_glmnet <- glmnet::glmnet(x = x_dev,
                             y = y_dev,
                             family = "binomial",
                             alpha = 0,
                             lambda = 0)

# Compare coefficients
data.frame(glm = fit_standard[["coefficients"]],
           glmnet = as.numeric(coef(fit_glmnet)))

                                    glm        glmnet
(Intercept)                15.999290431 15.6024704013
age                         0.036289668  0.0362930316
pol(d.sysbpt, 2)d.sysbpt   -0.275136026 -0.2691319203
pol(d.sysbpt, 2)d.sysbpt^2  0.001012795  0.0009903516
pupil.ino reactive pupils   1.822828253  1.8219674843
pupil.ione reactive         0.835094883  0.8342604853
causedomestic/fall          0.291831751  0.2920420943
causeMotorbike              0.002009155  0.0016114572
causeother                  0.452559793  0.4532842621
causeRoad traffic accident  0.242233817  0.2428490747

Note

Note that the regression coefficients are not exactly the same. This might be because glm() from {stats} and glmnet() from {glmnet} have implemented the estimation of the regression coefficients differently.

3.3 LASSO regression

The least absolute shrinkage and selection operator (LASSO) regression functions similar to ridge regression, except that it is able to shrink coefficients to 0, leading to removal of those predictors from the prediction model. It applies the \(l_1\) penalty and is expressed in the red part as:

\[ \log{L} = \sum_iy_i\log{\pi_i}+(1-y_i)\log{(1-\pi_i)}-\color{red}{\lambda\sum_{j=1}^{P}\lvert\hat{\beta}_j\lvert} \]

# Run cross-validation
cv_lasso <- glmnet::cv.glmnet(x = x_dev, 
                              y = y_dev, 
                              family = "binomial",
                              type.measure = "mse",
                              lambda = lambdas,
1                              alpha = 1,
                              nfolds = 10) 

# Show results of cross-validation
plot(cv_lasso)

1: Now we specify alpha = 1, indicating LASSO regression.

# Get optimal lambda
lambda_lasso <- cv_lasso[["lambda.1se"]]

Our optimal \(\lambda\) value is now:

lambda_lasso

[1] 0.03866416

Using our newly estimated \(\lambda\), we can fit a LASSO regression:

# Fit LASSO regression
fit_lasso <- glmnet::glmnet(x = x_dev,
                            y = y_dev,
                            family = "binomial",
                            alpha = 1,
                            lambda = lambda_lasso)
                    
# Compare coefficients
data.frame(standard = fit_standard[["coefficients"]],
           ridge = as.numeric(coef(fit_ridge)),
           lasso = as.numeric(coef(fit_lasso)))

                               standard         ridge       lasso
(Intercept)                15.999290431 -6.485339e-01 -1.50972408
age                         0.036289668  3.364532e-02  0.02164983
pol(d.sysbpt, 2)d.sysbpt   -0.275136026 -1.699942e-02  0.00000000
pol(d.sysbpt, 2)d.sysbpt^2  0.001012795  3.721493e-05  0.00000000
pupil.ino reactive pupils   1.822828253  1.683370e+00  1.26140256
pupil.ione reactive         0.835094883  7.489008e-01  0.19501530
causedomestic/fall          0.291831751  2.194116e-01  0.00000000
causeMotorbike              0.002009155 -1.027015e-01  0.00000000
causeother                  0.452559793  3.732177e-01  0.00000000
causeRoad traffic accident  0.242233817  1.666817e-01  0.00000000

From this output, we can see that two levels of cause were set to 0, effectively removing those variables from the final prediction model.

3.4 Elastic net regression

A special case is elastic net regression, where we combine the \(l_1\) and \(l_2\) penalties, or in other words blend ridge and LASSO regression. This takes the advantages of both approaches. A logistic elastic net model is expressed as

\[ \log{L} = \sum_iy_i\log{\pi_i}+(1-y_i)\log{(1-\pi_i)}- \color{green}{\lambda}\left( \color{red}{\alpha\sum_{j=1}^{P}\lvert\hat{\beta}_j\rvert} \color{black}{+} \color{blue}{\frac{1-\alpha}{2}\sum_{p=1}^P\hat{\beta}_p^2}\right) \]

We can see that \(\lambda\) applies to both penalties at the same time and that \(\alpha\) determines to what extent each penalty is applied. For instance, we can perform cross validation and an elastic net regression with an alpha of 0.5 as follows:

# Run cross-validation
cv_elnet <- glmnet::cv.glmnet(x = x_dev, 
                              y = y_dev, 
                              family = "binomial",
                              type.measure = "mse",
                              lambda = lambdas,
                              alpha = 0.5,
                              nfolds = 10) 

# Show results of cross-validation
plot(cv_elnet)

# Get optimal lambda
lambda_elnet <- cv_elnet[["lambda.1se"]]

Our \(\lambda\) is:

lambda_elnet

[1] 0.04999673

# Fit elastic net regression
fit_elnet <- glmnet::glmnet(x = x_dev,
                            y = y_dev,
                            family = "binomial",
                            alpha = 0.5,
                            lambda = lambda_elnet)
                    
# Compare coefficients
data.frame(standard = fit_standard[["coefficients"]],
           ridge = as.numeric(coef(fit_ridge)),
           lasso = as.numeric(coef(fit_lasso)),
           `elastic net` = as.numeric(coef(fit_elnet)))

                               standard         ridge       lasso  elastic.net
(Intercept)                15.999290431 -6.485339e-01 -1.50972408 -1.366747650
age                         0.036289668  3.364532e-02  0.02164983  0.023879622
pol(d.sysbpt, 2)d.sysbpt   -0.275136026 -1.699942e-02  0.00000000 -0.001830626
pol(d.sysbpt, 2)d.sysbpt^2  0.001012795  3.721493e-05  0.00000000  0.000000000
pupil.ino reactive pupils   1.822828253  1.683370e+00  1.26140256  1.291207172
pupil.ione reactive         0.835094883  7.489008e-01  0.19501530  0.344606628
causedomestic/fall          0.291831751  2.194116e-01  0.00000000  0.000000000
causeMotorbike              0.002009155 -1.027015e-01  0.00000000  0.000000000
causeother                  0.452559793  3.732177e-01  0.00000000  0.000000000
causeRoad traffic accident  0.242233817  1.666817e-01  0.00000000  0.000000000

Note

Note that when we estimate a \(\lambda\) for a model, we cannot automatically use that same \(\lambda\) for the same prediction model with a different shrinkage method. For instance, if I estimate a \(\lambda\) for a ridge regression model with a set of predictors \(X\) and outcome \(Y\), I will need to reestimate \(\lambda\) if I want to fit predictors \(X\) for outcome \(Y\) with a LASSO regression model.

3.5 Performance

3.5.1 Set-up for calculating validation measures

Now that we fit our standard and shrunk models, we can see how they compare in performance, especially in an external validation set.

We will first create a function to calculate the C-statistic for our models:

# Function to calculate C-statistic for output of a glmnet
1cstat <- function(fit,
2                  x = NULL,
3                  y = NULL,
4                  df = NULL) {
  if ("glmnet" %in% class(fit)) {
    # Calculate predictions for glmnet output
    preds <- predict(fit, newx = x, type = "response")
  } else {
    # Calculate predictions for glm output
    preds <- predict(fit, newdata = df, type = "response")
  }
  
  # Calculate C-statistic
  AUC <- pROC::roc(
    response = as.numeric(y),
    predictor = as.numeric(preds),
    levels = c(0, 1),
    direction = "<"
  )[["auc"]][[1]]
  
  # Return c-statistic
  return(AUC)
}

1: fit is the resulting fit from glm() or glmnet()
2: x is covariate matrix of the data to validate (in case of glmnet)
3: y is outcome matrix of the data to validate (in case of glmnet)
4: df is data frame including both the outcome and predictors (in case of glm)

We do not need to create these functions. However, to keep our script legible and more efficient, we pooled together the calculation of C-statistics for glm and glmnet objects. Simply, it calculates the predictions based on whether the model was fitted using glm() or glmnet() and then uses the roc() function from {pROC} to calculate the C-statistic and return it.

We will also need to create the x and y matrices of the validation data:

# Prepare predictors                            
x_val <- model.matrix(                                               
  ~ age + pol(d.sysbpt, 2) + pupil.i + cause,
  data = dplyr::select(validation, age, d.sysbpt, pupil.i, cause)
)
x_val <- x_val[, -1]

# Prepare outcome                                
y_val <- as.matrix(validation[, "d.unfav"])

Moreover, we will create a function to calculate the calibration intercept and slope for all models:

# Function to calculate slope
calmetrics <- function(fit,
                       x = NULL,
                       y = NULL,
                       df = NULL) {
  if ("glmnet" %in% class(fit)) {
    # Calculate linear predictor for glmnet output
    lp <- predict(fit, newx = x, type = "link")
  } else{
    # Calculate linear predictor for glm output
    lp <- predict(fit, newdata = df, type="link")
  }
  
  # Calculate calibration
  fit_int <- glm(y ~ offset(lp), family = binomial)
  fit_slope <- glm(y ~ lp, family = binomial)
  
  # Get intercept and slope
  cal_int <- round(fit_int[["coefficients"]][[1]], 3)
  cal_slope <- round(fit_slope[["coefficients"]][[2]], 3)
  
  # Return results
  return(list(cal_int = cal_int, cal_slope = cal_slope))
}

3.5.2 Obtain predictions for the validation en development data

# Function to calculate predictions
preds <- function(fit, x, y, df){
    if ("glmnet" %in% class(fit)){
      # Calculate predictions for glmnet output
      preds <- predict(fit, newx = x, type = "response")
    } else {
      # Calculate predictions for glm output
      preds <- predict(fit, newdata = df, type = "response")
    }
    
    # Return predictions
    return(preds)
}

We will put all predictions in a single data frame which we will need to draw our calibration plots for the development and validation data using all the different models:

## Create data tibble for each model
pred_dt <- c()
models <- c("standard", "ridge", "lasso", "elnet")
1for (model in models){
  # extract fit
2  fit <- eval(parse(text=paste0("fit_", model)))
  
3  # attach to data frame
  pred_dev <- preds(fit, x = x_dev, y = y_dev, df = development)
  pred_val <- preds(fit, x = x_val, y = y_val, df = validation)

  # create table
  dat <- tibble(
4    cohort = c(rep("Development", nrow(development)),
               rep("Validation", nrow(validation))),
    preds = c(pred_dev, pred_val), 
5    obs = c(y_dev, y_val)
  ) |>
6    mutate(model = model)
  
  # attach to prediction data frame
7  pred_dt <- rbind(pred_dt, dat)
}

1: Loop over all glm() and glmnet() models.
2: At each step of the loop, extract the fit.
3: Obtain predictions for each model.
4: Save predictions for development and validation into one data frame.
5: Save outcomes for development and validation into one data frame.
6: Add a column to indicate which model was used to obtain the predictions.
7: For each model, stack the results to obtain one data frame with all predictions using all models.

# edit final data frame
model_names <- c("Standard", "Ridge", "LASSO", "Elastic net")
pred_dt <- pred_dt |>
  mutate(
1    model = dplyr::recode(
      model,
      standard = "Standard",
      ridge = "Ridge",
      lasso = "LASSO",
      elnet = "Elastic net"
    ),
2    model = factor(model, levels = model_names)
  )
print(pred_dt)

1: Recode the strings used in the column model.
2: Define model as a factor.

# A tibble: 8,636 × 4
   cohort      preds   obs model   
   <chr>       <dbl> <int> <fct>   
 1 Development 0.203     1 Standard
 2 Development 0.162     0 Standard
 3 Development 0.135     0 Standard
 4 Development 0.151     0 Standard
 5 Development 0.866     1 Standard
 6 Development 0.538     1 Standard
 7 Development 0.167     0 Standard
 8 Development 0.539     1 Standard
 9 Development 0.122     0 Standard
10 Development 0.148     0 Standard
# ℹ 8,626 more rows

3.5.3 Validate models

Now we can determine how the different models performed.

# Calculate C-statistics
results <- tibble(model = rep(model_names, 2),
                  cohort = c(rep("Development", 4), rep("Validation", 4)),
                  cstat = c(cstat(fit_standard, x = x_dev, y = y_dev, df = development),
                            cstat(fit_ridge, x = x_dev, y = y_dev, df = development), 
                            cstat(fit_lasso, x = x_dev, y = y_dev, df = development), 
                            cstat(fit_elnet, x = x_dev, y = y_dev, df = development),
                            cstat(fit_standard, x = x_val, y = y_val, df = validation),
                            cstat(fit_ridge, x = x_val, y = y_val, df = validation), 
                            cstat(fit_lasso, x = x_val, y = y_val, df = validation), 
                            cstat(fit_elnet, x = x_val, y = y_val, df = validation)))
print(results)

# A tibble: 8 × 3
  model       cohort      cstat
  <chr>       <chr>       <dbl>
1 Standard    Development 0.751
2 Ridge       Development 0.739
3 LASSO       Development 0.728
4 Elastic net Development 0.732
5 Standard    Validation  0.707
6 Ridge       Validation  0.702
7 LASSO       Validation  0.688
8 Elastic net Validation  0.695

Let’s also look at the calibration slopes

# Calculate calibration slopes
results$cal_int <- c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_int, 
                     calmetrics(fit_ridge, x = x_dev, y = y_dev, df = development)$cal_int, 
                     calmetrics(fit_lasso, x = x_dev, y = y_dev, df = development)$cal_int, 
                     calmetrics(fit_elnet, x = x_dev, y = y_dev, df = development)$cal_int,
                     calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_int, 
                     calmetrics(fit_ridge, x = x_val, y = y_val, df = validation)$cal_int, 
                     calmetrics(fit_lasso, x = x_val, y = y_val, df = validation)$cal_int, 
                     calmetrics(fit_elnet, x = x_val, y = y_val, df = validation)$cal_int)
results$cal_slope <- c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_slope, 
                       calmetrics(fit_ridge, x = x_dev, y = y_dev, df = development)$cal_slope, 
                       calmetrics(fit_lasso, x = x_dev, y = y_dev, df = development)$cal_slope, 
                       calmetrics(fit_elnet, x = x_dev, y = y_dev, df = development)$cal_slope,
                       calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_slope, 
                       calmetrics(fit_ridge, x = x_val, y = y_val, df = validation)$cal_slope, 
                       calmetrics(fit_lasso, x = x_val, y = y_val, df = validation)$cal_slope, 
                       calmetrics(fit_elnet, x = x_val, y = y_val, df = validation)$cal_slope)
print(results)

# A tibble: 8 × 5
  model       cohort      cstat cal_int cal_slope
  <chr>       <chr>       <dbl>   <dbl>     <dbl>
1 Standard    Development 0.751   0         1    
2 Ridge       Development 0.739   0         1.10 
3 LASSO       Development 0.728   0         1.50 
4 Elastic net Development 0.732   0         1.48 
5 Standard    Validation  0.707   0.205     0.812
6 Ridge       Validation  0.702   0.225     0.964
7 LASSO       Validation  0.688   0.216     1.40 
8 Elastic net Validation  0.695   0.206     1.38

This shows that the discriminative ability is slightly lower in the validation data compared to the development data, which may be due to poor model fit or a less heterogeneous case-mix in the validation data compared to the development data. The calibration intercept is perfectly zero in the development set for all models, whereas all models on average underestimate in the validation data. Only the calibration slope of the standard model is one in the development set, but greater than one for the penalized models because we artifically force the coefficients cause underfitting in the development data. As a result, in the validation set the standard model is overfitting, because the coefficients of the standard model were too extreme, whereas the penalized models are underfitting.

We can also look at the calibration plots:

# Create calibration plots for each model
ggplot2::ggplot(pred_dt, mapping = ggplot2::aes(x = preds, y = obs, colour = cohort)) +
  # Geometries
  ggplot2::geom_abline(linewidth = 1,
                       colour = "black",
                       alpha = 0.33) +
  ggplot2::geom_point(alpha = 0.33) +
  ggplot2::geom_smooth(
    colour = "#785EF0",
    fill = "#785EF0",
    method = "loess",
    formula = "y ~ x"
  ) +
  # Scaling
  ggplot2::scale_colour_manual(values = c("#648FFF", "#FFB000"),
                               aesthetics = c("fill", "colour")) +
  # Labels
  ggplot2::xlab("Predicted probability") +
  ggplot2::ylab("Observed probability") +
  # Transformations
  ggplot2::coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
  ggplot2::facet_grid(rows = vars(model), cols = vars(cohort)) +
  # Aesthetics
  ggplot2::theme_bw() +
  ggplot2::theme(legend.title = element_blank(), legend.position = "bottom")

What we see is that (although small for the C-statistics), the standard model performs better in the development data, but worse in the validation data. Because the improvement is only minimal, the calibration plots do not show much difference. Note that this does not have to be the case: van Calster et al.. However, it is good practice to show the calibration plots to see how calibration varies across the range of predicted risk, and not only summarized into measures such as the calibration intercept and slope.

4 Bootstrap-based uniform shrinkage

The shrinkage methods we just saw work by adding a penalty to the maximum likelihood estimator. A different approach is bootstrap-based uniform shrinkage, which calculates a shrinkage factor and uses this to penalize the estimated regression coefficients (instead of penalizing the estimator that estimates the coefficients). This is also the method used in the {rms} package as seen in the first practical.

4.1 Development

In short, we develop the prediction model on our full development data. Then, we bootstrap the data B times and redevelop the model in each bootstrap sample b. Each redeveloped model is applied to the development data and a calibration slope is calculated. Then, the shrinkage factor s equals the average calibration slope among all bootstrap samples. The final regression coefficients are then calculated as \(\hat{\beta}_{j,shrunk} = \hat{\beta}_j \cdot s\). Finally, we recalibrate the model intercept.

4.1.1 Manually coded bootstrapping

We already developed our standard model. For the bootstrapping, we can do:

# set seed for reproducibility 
set.seed(1)

# Specify number of bootstraps
b <- 500

# Start bootstrapping procedure
1s <- sapply(1:b, \(x){
    # Set seed based on bootstrap iteration
2    set.seed(x)
    
    # Randomly sample data with replacement
    dat_smp <- slice_sample(development,  
3                            n = nrow(development),
4                            replace = TRUE)
    
    # Redevelop model
    fit_smp <- glm(d.unfav ~                                        
                       age +     
                       d.sysbpt +
                       pupil.i +
                       cause,  
5                   data = dat_smp, family = binomial)
    
    # Get linear predictors
6    development[["lps"]] <- predict(fit_smp, newdata = development)
    
    # Calculate calibration slope
    slope <- glm(d.unfav ~ lps, 
                 data = development,
7                 family = binomial)[["coefficients"]][["lps"]]
    
    # Return slope
    return(slope)
}) %>%
    # Take mean
8    mean()

cat("Calibration slope:", s, "\n")

1: Start bootstrapping procedure, where we iterate 500 times (b) and call a function (\(x)).
2: Set a seed for the random process each bootstrap (the sampling)
3: Sample as many individuals as our data contains.
4: Sample is with replacement, meaning the same individual can be sampled multiple times.
5: Redevelop the model on the sampled data.
6: Make predictions on the development data using the sample-based prediction model.
7: Calculate the slope.
8: Take the mean of the slopes from all bootstraps.

Calibration slope: 0.9574837

4.1.2 Bootstrapping using the `rms` package

Alternatively, if we use the rms package, we do not have to code the bootstrapping yourself:

fit_rms <- rms::lrm(d.unfav ~                                        
                       age +     
                       d.sysbpt +
                       pupil.i +
                       cause,
                   data = development,
                   x = TRUE,
                   y = TRUE)
1val_rms <- rms::validate(fit_rms, B = b, bw = FALSE)

1: We set backward = FALSE because we did not perform backward selection. If backward selection were to be applied, it would need to be carried out within each bootstrap. In general, the full modeling strategy, including functional form selection, backward selection, etc., should be applied to each bootstrap sample. Some of these modeling approaches may not be supported by the rms::validate() function, so implementing them manually, as shown above, could allow for more flexibility and more effectively capture optimism.

We now obtain the uniform shrinkage factor using the bootstrapped calibration slope:

cat("Calibration slope:", val_rms["Slope", "test"], "\n")

Calibration slope: 0.9598674

We observe only a very small difference to our manually coded approach.

Now that we have s, we can multiply it with the coefficients of the standard model:

# Take coefficients from standard model and multiply by s
coefs <- fit_standard[["coefficients"]][-1] * s

Lastly, using these coefficients, we need to re-estimate the intercept. We do this by using an offset, meaning we estimate the outcome as a function of the linear predictor, where the linear predictor is held constant (i.e. given a regression coefficient of 1).

# Calculate linear predictors
lps <- x_dev %*% as.matrix(coefs)

# Re-estimate intercept
intercept <- glm(development[["d.unfav"]] ~ offset(lps), 
                 family = binomial)[["coefficients"]][[1]]

# Combine intercept and coefficients
coef_uniform <- c(intercept, coefs)

# Compare models
data.frame(standard = fit_standard[["coefficients"]],
           uniform_shrinkage = coef_uniform)

                               standard uniform_shrinkage
(Intercept)                15.999290431     15.2980891582
age                         0.036289668      0.0347467652
pol(d.sysbpt, 2)d.sysbpt   -0.275136026     -0.2634382541
pol(d.sysbpt, 2)d.sysbpt^2  0.001012795      0.0009697343
pupil.ino reactive pupils   1.822828253      1.7453283021
pupil.ione reactive         0.835094883      0.7995897207
causedomestic/fall          0.291831751      0.2794241382
causeMotorbike              0.002009155      0.0019237329
causeother                  0.452559793      0.4333186157
causeRoad traffic accident  0.242233817      0.2319349261

4.2 Validation

Prior to validating the model, we will wrangle the existing model fit a bit so that it contains the information relevant for validating the model.

# Update fit with new coefficients
fit_uniform <- fit_standard

# Add updated coefficients
fit_uniform[["coefficients"]] <- coef_uniform

# Recalculate linear predictors
# We perform matrix multiplication of each individual's values with the coefficients minus the intercept (as the matrix x does not contain this). Afterwards, we add the intercept again.
fit_uniform[["linear.predictors"]] <- 
1  fit_uniform[["coefficients"]][[1]] +
2  x_dev %*% as.matrix(fit_uniform[["coefficients"]])[-1]

# Recalculate the predicted risk
3fit_uniform[["fitted.values"]] <-
  1 / (1 + exp(-fit_uniform[["linear.predictors"]]))

1: The linear predictor is \(LP = \alpha + X' \hat{\beta}\), this is the first part \(\alpha\) of this equation.
2: The linear predictor is \(LP = \alpha + X' \hat{\beta}\), this is the second part \(X' \hat{\beta}\) of this equation.
3: To transform the linear predictor into predicted probabilities, we transform them using \(P(Y=1|X)=\frac{1}{1+exp(-LP)}\).

This may seem like a bit of a hassle, especially since the {rms} package already implements this. On the other hand, as discussed in the previous practical, using the {rms} package might have its downsides.

Now let’s see the validation metrics:

tibble(model = c("Standard model", "Uniform shrinkage"),
       cstat_dev = c(cstat(fit_standard, x = x_dev, y = y_dev, df = development),
                     cstat(fit_uniform, x = x_dev, y = y_dev, df = development)),
       cstat_val = c(cstat(fit_standard, x = x_val, y = y_val, df = validation), 
                     cstat(fit_uniform, x = x_val, y = y_val, df = validation)),
       int_dev = c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_int,
                     calmetrics(fit_uniform, x = x_dev, y = y_dev, df = development)$cal_int),
       int_val = c(calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_int,
                     calmetrics(fit_uniform, x = x_val, y = y_val, df = validation)$cal_int),
       slope_dev = c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_slope,
                     calmetrics(fit_uniform, x = x_dev, y = y_dev, df = development)$cal_slope),
       slope_val = c(calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_slope,
                     calmetrics(fit_uniform, x = x_val, y = y_val, df = validation)$cal_slope))

# A tibble: 2 × 7
  model             cstat_dev cstat_val int_dev int_val slope_dev slope_val
  <chr>                 <dbl>     <dbl>   <dbl>   <dbl>     <dbl>     <dbl>
1 Standard model        0.751     0.707       0   0.205      1        0.812
2 Uniform shrinkage     0.751     0.707       0   0.201      1.04     0.848

After applying uniform shrinkage, the C-statistics remained unchanged, since the relative ordering of patients was preserved. In the development data, the standard model performs perfectly, with a calibration intercept of 0 and a slope of 1, whereas uniform shrinkage modifies the calibration slope while leaving the intercept unchanged. In the validation data, the uniformly shrunken model shows slightly better calibration, with a calibration intercept closer to zero (i.e., less underestimation) and a calibration slope closer to 1 (i.e., less overfitting) compared with the standard model.

5 Decision-curve analysis

Finally, we will examine an alternative approach to evaluating model performance: decision curve analysis (DCA). DCA quantifies a model’s clinical usefulness by calculating a net benefit, which weights true positives against the harm of false positives based on a chosen threshold probability. This approach helps determine whether using the model improves decision-making compared to default strategies such as treating all or no patients. Read more about DCA here.

Based on guidance from our TBI experts, the relevant range of threshold probabilities is considered to be between 20% and 40%. Importantly, this range must be determined beforehand and should not be influenced by the DCA analysis. Read more about this here.

# Obtain predictions on validation data
validation$predictions <- preds(fit_standard, x = x_val, y = y_val, df = validation)

# Perform DCA analysis on validation data
1dca.plot <- dcurves::dca(y_val ~ predictions,
2                         data = validation,
3                         label = list(predictions = "Prediction Model"))

1: Specify the formula as the outcome on the predictions.
2: DCA is most informative when applied to validation data, as it captures the model’s true clinical benefit on unseen cases. Using development data may overestimate benefit unless optimism is accounted for.
3: Change the label of the model in the plot.

Assuming '1' is [Event] and '0' is [non-Event]

# Create plot of DCA
plot(dca.plot)

# Create smoothed DCA curve
plot(dca.plot |> plot(smooth = TRUE))

# Create table with net benefit values at certain thresholds
data.frame(
  dca.plot$dca |>
    filter(
      threshold == 0.2 |
        threshold == 0.3 |
        threshold == 0.4
    ) |>
    arrange(threshold) |>
    select(label, threshold, net_benefit) |>
    mutate(threshold = threshold * 100) |>
    mutate_if(is.numeric, round, 2)
)

             label threshold net_benefit
1        Treat All        20        0.26
2       Treat None        20        0.00
3 Prediction Model        20        0.26
4        Treat All        30        0.15
5       Treat None        30        0.00
6 Prediction Model        30        0.19
7        Treat All        40        0.01
8       Treat None        40        0.00
9 Prediction Model        40        0.12

From the Table we observe that at a 30% threshold probability, the prediction model yields a net benefit of 0.19. This means that using the model is equivalent to correctly identifying 19 additional patients per 100 who will experience an unfavourable 6-month outcome, after accounting for the downsides of unnecessary interventions in low-risk patients. Consequently, the model offers greater clinical utility at this threshold than both treating all patients (net benefit 0.15) and treating none (net benefit 0).

6 Final words

We now discussed different shrinkage methods and have seen how to implement these. There are other methods available for shrinkage and these methods can also be applied to regression models other than logistic regression, but you are now familiar with the implementation of some of the most popular approaches.

7 Postscript

This practical was developed on:

R version 4.5.0 (2025-04-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.3.1 ggplot2_3.5.2   knitr_1.50      pROC_1.19.0.1  
 [5] rms_8.0-0       Hmisc_5.2-3     glmnet_4.1-9    Matrix_1.7-3   
 [9] rio_1.2.4       tidyr_1.3.1     magrittr_2.0.3  dplyr_1.1.4    

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       shape_1.4.6.1      xfun_0.52          htmlwidgets_1.6.4 
 [5] lattice_0.22-6     vctrs_0.6.5        tools_4.5.0        generics_0.1.4    
 [9] sandwich_3.1-1     tibble_3.2.1       cluster_2.1.8.1    pacman_0.5.1      
[13] R.oo_1.27.1        pkgconfig_2.0.3    data.table_1.17.2  checkmate_2.3.2   
[17] RColorBrewer_1.1-3 lifecycle_1.0.4    compiler_4.5.0     farver_2.1.2      
[21] stringr_1.5.1      MatrixModels_0.5-4 codetools_0.2-20   SparseM_1.84-2    
[25] dcurves_0.5.0      quantreg_6.1       htmltools_0.5.8.1  yaml_2.3.10       
[29] htmlTable_2.4.3    Formula_1.2-5      pillar_1.10.2      R.utils_2.13.0    
[33] MASS_7.3-65        iterators_1.0.14   rpart_4.1.24       multcomp_1.4-28   
[37] foreach_1.5.2      nlme_3.1-168       tidyselect_1.2.1   digest_0.6.37     
[41] mvtnorm_1.3-3      polspline_1.1.25   stringi_1.8.7      purrr_1.1.0       
[45] labeling_0.4.3     splines_4.5.0      fastmap_1.2.0      grid_4.5.0        
[49] colorspace_2.1-1   cli_3.6.5          base64enc_0.1-3    utf8_1.2.5        
[53] survival_3.8-3     TH.data_1.1-3      withr_3.0.2        foreign_0.8-90    
[57] scales_1.4.0       backports_1.5.0    rmarkdown_2.29     nnet_7.3-20       
[61] gridExtra_2.3      R.methodsS3_1.8.2  zoo_1.8-14         evaluate_1.0.3    
[65] mgcv_1.9-1         rlang_1.1.6        Rcpp_1.0.14        glue_1.8.0        
[69] rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1

1 Introduction

2 Set-up

2.1 Packages

2.2 Loading data

2.3 Preparing our data

2.4 Getting to know our data

3 Ridge, LASSO, and elastic net

3.1 Standard model

3.2 Ridge regression

3.3 LASSO regression

3.4 Elastic net regression

3.5 Performance

3.5.1 Set-up for calculating validation measures

3.5.2 Obtain predictions for the validation en development data

3.5.3 Validate models

4 Bootstrap-based uniform shrinkage

4.1 Development

4.1.1 Manually coded bootstrapping

4.1.2 Bootstrapping using the rms package

4.2 Validation

5 Decision-curve analysis

6 Final words

7 Postscript

4.1.2 Bootstrapping using the `rms` package