# Loading packages
pacman::p_load("dplyr", # Data wrangling
"magrittr", # More efficient pipelines
"tidyr", # Data tidying
"rio", # Importing data
"glmnet", # Shrinkage methods
"rms", # Modelling functions
"pROC", # C-statistics
"knitr", # Markdown functions
"ggplot2", # Data visualization
"patchwork" # Patching plots together
)PREDICT - Prediction Practical 2
Shrinkage Methods and Performance Metrics for Prediction Models
1 Introduction
In this practical, we will see how to implement different shrinkage methods, specifically ridge regression, LASSO regression, and elastic net regression. Additionally, we will look at uniform shrinkage using bootstrapping. In this practical you will get more familiar with obtaining performance metrics (i.e., calibration, discrimination, and net benefit).
At the end of this practical, you will know how you can estimate the regularization parameter (\(\lambda\)) and use it to perform these shrinkage-based regressions, as well as perform uniform shrinkage.
2 Set-up
2.1 Packages
As usual, before we get going, we will load some useful packages.
2.2 Loading data
For this practical, we will again use the Traumatic brain injury (TBI) data. This dataset contains 2,159 patients from the international and US Tirilazad trials (distributed here for didactic purposes only). The primary outcome was the Glasgow Outcome (range 1 through 5) at 6 months.
The TBI dataset was sent to you together with this practical and can be loaded from your local device:
# Specify path of TBI dataset
path <- "C:/users/avid_PINT_student/documents/pint/TBI.txt"
# Load TBI data
tbi <- import(path)Warning in (function (input = "", file = NULL, text = NULL, cmd = NULL, :
Detected 24 column names but the data has 25 columns (i.e. invalid file). Added
an extra default column name for the first column which is guessed to be row
names or an index. Use setnames() afterwards if this guess is not correct, or
fix the file write command that created the file to create a valid file.
The warning we receive is the result of row numbers being present in the .txt file (as also guessed by import()). Given that import() guessed correct, we can ignore this warning and treat the extra column V1 as individual identifiers.
Below is the codebook for the TBI data.
| TBI | ||
| Variable | Description | |
| V1 | Identifier | |
| trial | Trial identification | |
| d.gos | Glasgow Outcome Scale at 6 months
|
|
| d.mort | Mortality at 6 months | |
| d.unfav | Unfavourable outcome at 6 months | |
| cause | Cause of injury
|
|
| age | Age (yrs) | |
| d.motor | Admission motor score (range 1-6) | |
| d.pupil | Pupillary reactivity
|
|
| pupil.i | Single imputed pupillary reactivity | |
| hypoxia | Hypoxia before or at admission | |
| hypotens | Hypotension before or at admission | |
| ctclass | Marshall CT classification (range 1-6) | |
| tsah | tSAH at CT | |
| edh | EDH at CT | |
| cisterns | Compressed cisterns at CT
|
|
| shift | Midline shift >5 mm at CT | |
| d.sysbpt | Systolic blood pressure at admission (mmHg) | |
| glucose | Glucose at admission (mmol/L) | |
| glucoset | Truncated glucose values (mmol/L) | |
| ph | pH | |
| sodium | Sodium (mmol/L) | |
| hb | Hemoglobin (g/dL) | |
| hbt | Truncated hemoglobin (g/dL) | |
2.3 Preparing our data
Like last time, we will develop a prediction model for the risk of an unfavourable outcome after a TBI. However, we will geographically split the data into a development dataset (individuals from the Tirilazad US trial) and a validation dataset (individuals from the Tirilazad International trial). Additionally, although not recommended, for educational simplicity, we have again used single imputation for missing data.
2.4 Getting to know our data
To get to know our data, we will look at the number of cases in both datasets.
tbi[["trial"]]: Tirilazad International
0 1
662 456
------------------------------------------------------------
tbi[["trial"]]: Tirilazad US
0 1
646 395
# Distribution of non-cases and cases (as %)
by(tbi[["d.unfav"]], tbi[["trial"]], \(x) proportions(table(x)) * 100)tbi[["trial"]]: Tirilazad International
x
0 1
59.21288 40.78712
------------------------------------------------------------
tbi[["trial"]]: Tirilazad US
x
0 1
62.05572 37.94428
In the development dataset, we can see that we have quite some cases (n = 395, 37.9%). We have a bit more in the validation dataset (n = 456, 40.8%).
We consulted TBI experts, who told us that the most important predictors we want to include are pupillary reactivity (pupil.i), TBI cause (cause), systolic blood pressure (d.sysbpt), and age (age). We now use the pupil.i variable that is obtained using single impuation. We check if the continuous variables age and d.sysdpt need to be modelled non-linearly This would mean that we have used are a total of 12 parameters, even though we conclude to use linear age and model dsysbpt using a piecewise linear transformation.
# Model using imputed pupil information
dd <- rms::datadist(development)
options(datadist="dd")
non_linear_model <- rms::lrm(
d.unfav ~
rms::rcs(age, 4) +
rms::rcs(d.sysbpt, 4) +
pupil.i +
cause,
data = development)
# Indicates non-linearity for systolic blood pressure
anova(non_linear_model) Wald Statistics Response: d.unfav
Factor Chi-Square d.f. P
age 38.62 3 <.0001
Nonlinear 3.00 2 0.2227
d.sysbpt 24.19 3 <.0001
Nonlinear 23.01 2 <.0001
pupil.i 107.79 2 <.0001
cause 3.93 4 0.4156
TOTAL NONLINEAR 25.02 4 <.0001
TOTAL 160.69 12 <.0001
If we were to calculate an events-per-variable (EPV), this would be:
Effects Response : d.unfav
Factor Low High Diff. Effect
age 23.00 41.00 18.00 0.884730
Odds Ratio 23.00 41.00 18.00 2.422300
d.sysbpt 123.17 143.13 19.96 0.184420
Odds Ratio 123.17 143.13 19.96 1.202500
pupil.i - no reactive pupils:both reactive 1.00 2.00 NA 1.831500
Odds Ratio 1.00 2.00 NA 6.243400
pupil.i - one reactive:both reactive 1.00 3.00 NA 0.866100
Odds Ratio 1.00 3.00 NA 2.377600
cause - Assault:Road traffic accident 5.00 1.00 NA -0.251800
Odds Ratio 5.00 1.00 NA 0.777400
cause - domestic/fall:Road traffic accident 5.00 2.00 NA 0.072995
Odds Ratio 5.00 2.00 NA 1.075700
cause - Motorbike:Road traffic accident 5.00 3.00 NA -0.215150
Odds Ratio 5.00 3.00 NA 0.806420
cause - other:Road traffic accident 5.00 4.00 NA 0.206520
Odds Ratio 5.00 4.00 NA 1.229400
S.E. Lower 0.95 Upper 0.95
0.20000 0.49274 1.27670
NA 1.63680 3.58490
0.19087 -0.18967 0.55851
NA 0.82723 1.74810
0.17951 1.47970 2.18340
NA 4.39150 8.87610
0.21213 0.45033 1.28190
NA 1.56880 3.60340
0.26283 -0.76694 0.26335
NA 0.46443 1.30130
0.22052 -0.35922 0.50521
NA 0.69822 1.65730
0.23124 -0.66837 0.23806
NA 0.51255 1.26880
0.20083 -0.18709 0.60014
NA 0.82937 1.82240
nr_coef <- length(coef(non_linear_model)[-1])
cat("Number of coefficients (excluding intercept):", nr_coef, "\n")Number of coefficients (excluding intercept): 12
tbi[["trial"]]: Tirilazad International
[1] 38
------------------------------------------------------------
tbi[["trial"]]: Tirilazad US
[1] 32.91667
This is not bad, but EPV is also not an ideal measure. More formal sample size calculations are available but are outside the scope of this course.
To investigate the best functional form for systolic blood pressure, let us try a couple models and compare the performance using the BIC (read more here) and look at the effect plot.
# restricted cubic splines
model_1 <- rms::lrm(
d.unfav ~
age +
rms::rcs(d.sysbpt, 4) +
pupil.i +
cause,
data = development)
# quadratic
model_2 <- rms::lrm(
d.unfav ~
age +
pol(d.sysbpt, 2) +
pupil.i +
cause,
data = development)
# linear piecewise regression
model_3 <- rms::lrm(
d.unfav ~
age +
lsp(d.sysbpt, 125) +
pupil.i +
cause,
data = development)
# compare performance
cat("Model performance of model 1:", BIC(model_1), "\n",
"Model performance of model 2:", BIC(model_2), "\n",
"Model performance of model 3:", BIC(model_3), "\n")Model performance of model 1: 1250.81
Model performance of model 2: 1244.537
Model performance of model 3: 1244.47
As the results show, the model performs similarly across specifications. Because we prefer not to impose a specific value for the location of the kink, we will instead use a quadratic term to allow the relationship to vary smoothly.
3 Ridge, LASSO, and elastic net
Please have a look at this post for a visual presentation of the three regularization methods.
3.1 Standard model
To show how shrinkage affects the model, we will first develop a standard logistic prediction model.
# Develop the prediction model
fit_standard <- glm(d.unfav ~
age +
pol(d.sysbpt, 2) +
pupil.i +
cause,
data = development,
family = binomial)
# See output
summary(fit_standard)
Call:
glm(formula = d.unfav ~ age + pol(d.sysbpt, 2) + pupil.i + cause,
family = binomial, data = development)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.9992904 3.9739447 4.026 5.67e-05 ***
age 0.0362897 0.0060717 5.977 2.27e-09 ***
pol(d.sysbpt, 2)d.sysbpt -0.2751360 0.0596979 -4.609 4.05e-06 ***
pol(d.sysbpt, 2)d.sysbpt^2 0.0010128 0.0002239 4.524 6.06e-06 ***
pupil.ino reactive pupils 1.8228283 0.1791413 10.175 < 2e-16 ***
pupil.ione reactive 0.8350949 0.2109528 3.959 7.54e-05 ***
causedomestic/fall 0.2918318 0.2967658 0.983 0.325
causeMotorbike 0.0020092 0.3120119 0.006 0.995
causeother 0.4525598 0.2865500 1.579 0.114
causeRoad traffic accident 0.2422338 0.2597648 0.933 0.351
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1382.0 on 1040 degrees of freedom
Residual deviance: 1175.1 on 1031 degrees of freedom
AIC: 1195.1
Number of Fisher Scoring iterations: 4
3.2 Ridge regression
Ridge regression, also known as \(l_2\) regularization, applies a penalty to the maximum likelihood estimator, leading to shrinkage of the regression coefficients, which results in a less optimistic model. In the below formula, the first part shows the maximum likelihood estimator of a logistic model and the second (blue) part the addition of the \(l_2\) regularization:
\[ \log{L} = \sum_iy_i\log{\pi_i}+(1-y_i)\log{(1-\pi_i)}-\color{blue}{\lambda\sum_{j=1}^P\hat{\beta}_j^2} \]
where \(i\) stands for individual \(i\), \(y_i\) is the observed outcome for individual \(i\) and \(\pi_i\) is the predicted outcome for individual i.
We can fit a ridge regression using glmnet() from the {glmnet} package. However, first we require a value for \(\lambda\). The preferred way to determine this is by \(k\)-fold cross-validation (read more here), as implemented in the function cv.glmnet()from {glmnet} package:
1# Prepare predictors
2x_dev <- model.matrix(
~ age + pol(d.sysbpt, 2) + pupil.i + cause,
data = dplyr::select(development, age, d.sysbpt, pupil.i, cause)
)
3x_dev <- x_dev[, -1]
4# Prepare outcome
y_dev <- as.matrix(development[, "d.unfav"])
# Define range of lambdas
5lambdas <- seq(0.001, 10, length.out = 3e4)
# Run cross-validation (this might take a minute)
cv_ridge <- glmnet::cv.glmnet(x = x_dev,
y = y_dev,
family = "binomial",
6 type.measure = "mse",
lambda = lambdas,
7 alpha = 0,
8 nfolds = 10)
# Get optimal lambda
cat(" lambda.min :", cv_ridge[["lambda.min"]], "\n",
"log(lambda.min) :", log(cv_ridge[["lambda.min"]]), "\n",
"lambda.1se :", cv_ridge[["lambda.1se"]], "\n",
"log(lambda.1se) :", log(cv_ridge[["lambda.1se"]]), "\n")
9lambda_ridge <- cv_ridge[["lambda.1se"]]
10# Show results of cross-validation
plot(cv_ridge)- 1
-
The function
cv.glmnet()requires a matrix with the predictors, where each row corresponds to an observation and each column a predictor. - 2
-
Factors are not allowed in the x matrix, which means that we should already create dummy variables for the factors. The function
model.matrix()does this for us. - 3
-
Because
glmnetautomatically includes an intercept, we remove the first column to not generate an additional intercept. - 4
- The outcome should also be given as a matrix or vector.
- 5
-
We specify our own range of lambda’s we would like to test using cross-validation. Note that
glmnet()can also do this on our own, but this may be unreliable. - 6
-
We selected the model based on mean squared error, but other performance metrics can be used as well. See the documentation with
?cv.glmnetfor details. - 7
-
We specify
alpha = 0to indicate that we want to use ridge regression. If we usealpha = 1, we indicate LASSO regression. Anything in between would result in elastic net regression. - 8
- We specify 10-fold cross-validation.
- 9
-
The optimal \(\lambda\) is derived by taking
lambda.1sefrom the results, alternatively we could extractlambda.min. Uselambda.minwhen you want the model with best predictive accuracy. Uselambda.1sewhen you prefer a simpler, more regularized model that is often more robust. Read more here. - 10
- We can show that the \(\lambda\) we selected (the dashed line) is the one with the lowest mean squared error. The x-axis is in log scale for clarity.
lambda.min : 0.001
log(lambda.min) : -6.907755
lambda.1se : 0.01533238
log(lambda.1se) : -4.177788
We choose our optimal \(\lambda\) value to be:
Now we can fit our ridge regression model:
# Fit ridge regression
fit_ridge <- glmnet::glmnet(x = x_dev,
y = y_dev,
family = "binomial",
alpha = 0,
lambda = lambda_ridge)
# Compare coefficients
data.frame(standard = fit_standard[["coefficients"]],
ridge = as.numeric(coef(fit_ridge))) standard ridge
(Intercept) 15.999290431 -6.485339e-01
age 0.036289668 3.364532e-02
pol(d.sysbpt, 2)d.sysbpt -0.275136026 -1.699942e-02
pol(d.sysbpt, 2)d.sysbpt^2 0.001012795 3.721493e-05
pupil.ino reactive pupils 1.822828253 1.683370e+00
pupil.ione reactive 0.835094883 7.489008e-01
causedomestic/fall 0.291831751 2.194116e-01
causeMotorbike 0.002009155 -1.027015e-01
causeother 0.452559793 3.732177e-01
causeRoad traffic accident 0.242233817 1.666817e-01
Comparing these coefficients to the standard model, we may observe that the coefficients are shrunk. If you have a polychotomous predictor (i.e., categorical predictor with more than two levels), not all coefficients for that predictor might shrink.
If we set \(\lambda\) to 0, we can see that it will be practically the same as the standard logistic regression:
# Fit ridge regression with lambda 0
fit_glmnet <- glmnet::glmnet(x = x_dev,
y = y_dev,
family = "binomial",
alpha = 0,
lambda = 0)
# Compare coefficients
data.frame(glm = fit_standard[["coefficients"]],
glmnet = as.numeric(coef(fit_glmnet))) glm glmnet
(Intercept) 15.999290431 15.6024704013
age 0.036289668 0.0362930316
pol(d.sysbpt, 2)d.sysbpt -0.275136026 -0.2691319203
pol(d.sysbpt, 2)d.sysbpt^2 0.001012795 0.0009903516
pupil.ino reactive pupils 1.822828253 1.8219674843
pupil.ione reactive 0.835094883 0.8342604853
causedomestic/fall 0.291831751 0.2920420943
causeMotorbike 0.002009155 0.0016114572
causeother 0.452559793 0.4532842621
causeRoad traffic accident 0.242233817 0.2428490747
Note that the regression coefficients are not exactly the same. This might be because glm() from {stats} and glmnet() from {glmnet} have implemented the estimation of the regression coefficients differently.
3.3 LASSO regression
The least absolute shrinkage and selection operator (LASSO) regression functions similar to ridge regression, except that it is able to shrink coefficients to 0, leading to removal of those predictors from the prediction model. It applies the \(l_1\) penalty and is expressed in the red part as:
\[ \log{L} = \sum_iy_i\log{\pi_i}+(1-y_i)\log{(1-\pi_i)}-\color{red}{\lambda\sum_{j=1}^{P}\lvert\hat{\beta}_j\lvert} \]
# Run cross-validation
cv_lasso <- glmnet::cv.glmnet(x = x_dev,
y = y_dev,
family = "binomial",
type.measure = "mse",
lambda = lambdas,
1 alpha = 1,
nfolds = 10)
# Show results of cross-validation
plot(cv_lasso)- 1
-
Now we specify
alpha = 1, indicating LASSO regression.
Our optimal \(\lambda\) value is now:
Using our newly estimated \(\lambda\), we can fit a LASSO regression:
# Fit LASSO regression
fit_lasso <- glmnet::glmnet(x = x_dev,
y = y_dev,
family = "binomial",
alpha = 1,
lambda = lambda_lasso)
# Compare coefficients
data.frame(standard = fit_standard[["coefficients"]],
ridge = as.numeric(coef(fit_ridge)),
lasso = as.numeric(coef(fit_lasso))) standard ridge lasso
(Intercept) 15.999290431 -6.485339e-01 -1.50972408
age 0.036289668 3.364532e-02 0.02164983
pol(d.sysbpt, 2)d.sysbpt -0.275136026 -1.699942e-02 0.00000000
pol(d.sysbpt, 2)d.sysbpt^2 0.001012795 3.721493e-05 0.00000000
pupil.ino reactive pupils 1.822828253 1.683370e+00 1.26140256
pupil.ione reactive 0.835094883 7.489008e-01 0.19501530
causedomestic/fall 0.291831751 2.194116e-01 0.00000000
causeMotorbike 0.002009155 -1.027015e-01 0.00000000
causeother 0.452559793 3.732177e-01 0.00000000
causeRoad traffic accident 0.242233817 1.666817e-01 0.00000000
From this output, we can see that two levels of cause were set to 0, effectively removing those variables from the final prediction model.
3.4 Elastic net regression
A special case is elastic net regression, where we combine the \(l_1\) and \(l_2\) penalties, or in other words blend ridge and LASSO regression. This takes the advantages of both approaches. A logistic elastic net model is expressed as
\[ \log{L} = \sum_iy_i\log{\pi_i}+(1-y_i)\log{(1-\pi_i)}- \color{green}{\lambda}\left( \color{red}{\alpha\sum_{j=1}^{P}\lvert\hat{\beta}_j\rvert} \color{black}{+} \color{blue}{\frac{1-\alpha}{2}\sum_{p=1}^P\hat{\beta}_p^2}\right) \]
We can see that \(\lambda\) applies to both penalties at the same time and that \(\alpha\) determines to what extent each penalty is applied. For instance, we can perform cross validation and an elastic net regression with an alpha of 0.5 as follows:
# Run cross-validation
cv_elnet <- glmnet::cv.glmnet(x = x_dev,
y = y_dev,
family = "binomial",
type.measure = "mse",
lambda = lambdas,
alpha = 0.5,
nfolds = 10)
# Show results of cross-validation
plot(cv_elnet)Our \(\lambda\) is:
# Fit elastic net regression
fit_elnet <- glmnet::glmnet(x = x_dev,
y = y_dev,
family = "binomial",
alpha = 0.5,
lambda = lambda_elnet)
# Compare coefficients
data.frame(standard = fit_standard[["coefficients"]],
ridge = as.numeric(coef(fit_ridge)),
lasso = as.numeric(coef(fit_lasso)),
`elastic net` = as.numeric(coef(fit_elnet))) standard ridge lasso elastic.net
(Intercept) 15.999290431 -6.485339e-01 -1.50972408 -1.366747650
age 0.036289668 3.364532e-02 0.02164983 0.023879622
pol(d.sysbpt, 2)d.sysbpt -0.275136026 -1.699942e-02 0.00000000 -0.001830626
pol(d.sysbpt, 2)d.sysbpt^2 0.001012795 3.721493e-05 0.00000000 0.000000000
pupil.ino reactive pupils 1.822828253 1.683370e+00 1.26140256 1.291207172
pupil.ione reactive 0.835094883 7.489008e-01 0.19501530 0.344606628
causedomestic/fall 0.291831751 2.194116e-01 0.00000000 0.000000000
causeMotorbike 0.002009155 -1.027015e-01 0.00000000 0.000000000
causeother 0.452559793 3.732177e-01 0.00000000 0.000000000
causeRoad traffic accident 0.242233817 1.666817e-01 0.00000000 0.000000000
Note that when we estimate a \(\lambda\) for a model, we cannot automatically use that same \(\lambda\) for the same prediction model with a different shrinkage method. For instance, if I estimate a \(\lambda\) for a ridge regression model with a set of predictors \(X\) and outcome \(Y\), I will need to reestimate \(\lambda\) if I want to fit predictors \(X\) for outcome \(Y\) with a LASSO regression model.
3.5 Performance
3.5.1 Set-up for calculating validation measures
Now that we fit our standard and shrunk models, we can see how they compare in performance, especially in an external validation set.
We will first create a function to calculate the C-statistic for our models:
# Function to calculate C-statistic for output of a glmnet
1cstat <- function(fit,
2 x = NULL,
3 y = NULL,
4 df = NULL) {
if ("glmnet" %in% class(fit)) {
# Calculate predictions for glmnet output
preds <- predict(fit, newx = x, type = "response")
} else {
# Calculate predictions for glm output
preds <- predict(fit, newdata = df, type = "response")
}
# Calculate C-statistic
AUC <- pROC::roc(
response = as.numeric(y),
predictor = as.numeric(preds),
levels = c(0, 1),
direction = "<"
)[["auc"]][[1]]
# Return c-statistic
return(AUC)
}- 1
-
fitis the resulting fit from glm() or glmnet() - 2
-
xis covariate matrix of the data to validate (in case of glmnet) - 3
-
yis outcome matrix of the data to validate (in case of glmnet) - 4
-
dfis data frame including both the outcome and predictors (in case of glm)
We do not need to create these functions. However, to keep our script legible and more efficient, we pooled together the calculation of C-statistics for glm and glmnet objects. Simply, it calculates the predictions based on whether the model was fitted using glm() or glmnet() and then uses the roc() function from {pROC} to calculate the C-statistic and return it.
We will also need to create the x and y matrices of the validation data:
Moreover, we will create a function to calculate the calibration intercept and slope for all models:
# Function to calculate slope
calmetrics <- function(fit,
x = NULL,
y = NULL,
df = NULL) {
if ("glmnet" %in% class(fit)) {
# Calculate linear predictor for glmnet output
lp <- predict(fit, newx = x, type = "link")
} else{
# Calculate linear predictor for glm output
lp <- predict(fit, newdata = df, type="link")
}
# Calculate calibration
fit_int <- glm(y ~ offset(lp), family = binomial)
fit_slope <- glm(y ~ lp, family = binomial)
# Get intercept and slope
cal_int <- round(fit_int[["coefficients"]][[1]], 3)
cal_slope <- round(fit_slope[["coefficients"]][[2]], 3)
# Return results
return(list(cal_int = cal_int, cal_slope = cal_slope))
}3.5.2 Obtain predictions for the validation en development data
# Function to calculate predictions
preds <- function(fit, x, y, df){
if ("glmnet" %in% class(fit)){
# Calculate predictions for glmnet output
preds <- predict(fit, newx = x, type = "response")
} else {
# Calculate predictions for glm output
preds <- predict(fit, newdata = df, type = "response")
}
# Return predictions
return(preds)
}We will put all predictions in a single data frame which we will need to draw our calibration plots for the development and validation data using all the different models:
## Create data tibble for each model
pred_dt <- c()
models <- c("standard", "ridge", "lasso", "elnet")
1for (model in models){
# extract fit
2 fit <- eval(parse(text=paste0("fit_", model)))
3 # attach to data frame
pred_dev <- preds(fit, x = x_dev, y = y_dev, df = development)
pred_val <- preds(fit, x = x_val, y = y_val, df = validation)
# create table
dat <- tibble(
4 cohort = c(rep("Development", nrow(development)),
rep("Validation", nrow(validation))),
preds = c(pred_dev, pred_val),
5 obs = c(y_dev, y_val)
) |>
6 mutate(model = model)
# attach to prediction data frame
7 pred_dt <- rbind(pred_dt, dat)
}- 1
- Loop over all glm() and glmnet() models.
- 2
- At each step of the loop, extract the fit.
- 3
- Obtain predictions for each model.
- 4
- Save predictions for development and validation into one data frame.
- 5
- Save outcomes for development and validation into one data frame.
- 6
- Add a column to indicate which model was used to obtain the predictions.
- 7
- For each model, stack the results to obtain one data frame with all predictions using all models.
# edit final data frame
model_names <- c("Standard", "Ridge", "LASSO", "Elastic net")
pred_dt <- pred_dt |>
mutate(
1 model = dplyr::recode(
model,
standard = "Standard",
ridge = "Ridge",
lasso = "LASSO",
elnet = "Elastic net"
),
2 model = factor(model, levels = model_names)
)
print(pred_dt)- 1
-
Recode the strings used in the column
model. - 2
-
Define
modelas a factor.
# A tibble: 8,636 × 4
cohort preds obs model
<chr> <dbl> <int> <fct>
1 Development 0.203 1 Standard
2 Development 0.162 0 Standard
3 Development 0.135 0 Standard
4 Development 0.151 0 Standard
5 Development 0.866 1 Standard
6 Development 0.538 1 Standard
7 Development 0.167 0 Standard
8 Development 0.539 1 Standard
9 Development 0.122 0 Standard
10 Development 0.148 0 Standard
# ℹ 8,626 more rows
3.5.3 Validate models
Now we can determine how the different models performed.
# Calculate C-statistics
results <- tibble(model = rep(model_names, 2),
cohort = c(rep("Development", 4), rep("Validation", 4)),
cstat = c(cstat(fit_standard, x = x_dev, y = y_dev, df = development),
cstat(fit_ridge, x = x_dev, y = y_dev, df = development),
cstat(fit_lasso, x = x_dev, y = y_dev, df = development),
cstat(fit_elnet, x = x_dev, y = y_dev, df = development),
cstat(fit_standard, x = x_val, y = y_val, df = validation),
cstat(fit_ridge, x = x_val, y = y_val, df = validation),
cstat(fit_lasso, x = x_val, y = y_val, df = validation),
cstat(fit_elnet, x = x_val, y = y_val, df = validation)))
print(results)# A tibble: 8 × 3
model cohort cstat
<chr> <chr> <dbl>
1 Standard Development 0.751
2 Ridge Development 0.739
3 LASSO Development 0.728
4 Elastic net Development 0.732
5 Standard Validation 0.707
6 Ridge Validation 0.702
7 LASSO Validation 0.688
8 Elastic net Validation 0.695
Let’s also look at the calibration slopes
# Calculate calibration slopes
results$cal_int <- c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_int,
calmetrics(fit_ridge, x = x_dev, y = y_dev, df = development)$cal_int,
calmetrics(fit_lasso, x = x_dev, y = y_dev, df = development)$cal_int,
calmetrics(fit_elnet, x = x_dev, y = y_dev, df = development)$cal_int,
calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_int,
calmetrics(fit_ridge, x = x_val, y = y_val, df = validation)$cal_int,
calmetrics(fit_lasso, x = x_val, y = y_val, df = validation)$cal_int,
calmetrics(fit_elnet, x = x_val, y = y_val, df = validation)$cal_int)
results$cal_slope <- c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_slope,
calmetrics(fit_ridge, x = x_dev, y = y_dev, df = development)$cal_slope,
calmetrics(fit_lasso, x = x_dev, y = y_dev, df = development)$cal_slope,
calmetrics(fit_elnet, x = x_dev, y = y_dev, df = development)$cal_slope,
calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_slope,
calmetrics(fit_ridge, x = x_val, y = y_val, df = validation)$cal_slope,
calmetrics(fit_lasso, x = x_val, y = y_val, df = validation)$cal_slope,
calmetrics(fit_elnet, x = x_val, y = y_val, df = validation)$cal_slope)
print(results)# A tibble: 8 × 5
model cohort cstat cal_int cal_slope
<chr> <chr> <dbl> <dbl> <dbl>
1 Standard Development 0.751 0 1
2 Ridge Development 0.739 0 1.10
3 LASSO Development 0.728 0 1.50
4 Elastic net Development 0.732 0 1.48
5 Standard Validation 0.707 0.205 0.812
6 Ridge Validation 0.702 0.225 0.964
7 LASSO Validation 0.688 0.216 1.40
8 Elastic net Validation 0.695 0.206 1.38
This shows that the discriminative ability is slightly lower in the validation data compared to the development data, which may be due to poor model fit or a less heterogeneous case-mix in the validation data compared to the development data. The calibration intercept is perfectly zero in the development set for all models, whereas all models on average underestimate in the validation data. Only the calibration slope of the standard model is one in the development set, but greater than one for the penalized models because we artifically force the coefficients cause underfitting in the development data. As a result, in the validation set the standard model is overfitting, because the coefficients of the standard model were too extreme, whereas the penalized models are underfitting.
We can also look at the calibration plots:
# Create calibration plots for each model
ggplot2::ggplot(pred_dt, mapping = ggplot2::aes(x = preds, y = obs, colour = cohort)) +
# Geometries
ggplot2::geom_abline(linewidth = 1,
colour = "black",
alpha = 0.33) +
ggplot2::geom_point(alpha = 0.33) +
ggplot2::geom_smooth(
colour = "#785EF0",
fill = "#785EF0",
method = "loess",
formula = "y ~ x"
) +
# Scaling
ggplot2::scale_colour_manual(values = c("#648FFF", "#FFB000"),
aesthetics = c("fill", "colour")) +
# Labels
ggplot2::xlab("Predicted probability") +
ggplot2::ylab("Observed probability") +
# Transformations
ggplot2::coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
ggplot2::facet_grid(rows = vars(model), cols = vars(cohort)) +
# Aesthetics
ggplot2::theme_bw() +
ggplot2::theme(legend.title = element_blank(), legend.position = "bottom")What we see is that (although small for the C-statistics), the standard model performs better in the development data, but worse in the validation data. Because the improvement is only minimal, the calibration plots do not show much difference. Note that this does not have to be the case: van Calster et al.. However, it is good practice to show the calibration plots to see how calibration varies across the range of predicted risk, and not only summarized into measures such as the calibration intercept and slope.
4 Bootstrap-based uniform shrinkage
The shrinkage methods we just saw work by adding a penalty to the maximum likelihood estimator. A different approach is bootstrap-based uniform shrinkage, which calculates a shrinkage factor and uses this to penalize the estimated regression coefficients (instead of penalizing the estimator that estimates the coefficients). This is also the method used in the {rms} package as seen in the first practical.
4.1 Development
In short, we develop the prediction model on our full development data. Then, we bootstrap the data B times and redevelop the model in each bootstrap sample b. Each redeveloped model is applied to the development data and a calibration slope is calculated. Then, the shrinkage factor s equals the average calibration slope among all bootstrap samples. The final regression coefficients are then calculated as \(\hat{\beta}_{j,shrunk} = \hat{\beta}_j \cdot s\). Finally, we recalibrate the model intercept.
4.1.1 Manually coded bootstrapping
We already developed our standard model. For the bootstrapping, we can do:
# set seed for reproducibility
set.seed(1)
# Specify number of bootstraps
b <- 500
# Start bootstrapping procedure
1s <- sapply(1:b, \(x){
# Set seed based on bootstrap iteration
2 set.seed(x)
# Randomly sample data with replacement
dat_smp <- slice_sample(development,
3 n = nrow(development),
4 replace = TRUE)
# Redevelop model
fit_smp <- glm(d.unfav ~
age +
d.sysbpt +
pupil.i +
cause,
5 data = dat_smp, family = binomial)
# Get linear predictors
6 development[["lps"]] <- predict(fit_smp, newdata = development)
# Calculate calibration slope
slope <- glm(d.unfav ~ lps,
data = development,
7 family = binomial)[["coefficients"]][["lps"]]
# Return slope
return(slope)
}) %>%
# Take mean
8 mean()
cat("Calibration slope:", s, "\n")- 1
-
Start bootstrapping procedure, where we iterate 500 times (b) and call a function (
\(x)). - 2
- Set a seed for the random process each bootstrap (the sampling)
- 3
- Sample as many individuals as our data contains.
- 4
- Sample is with replacement, meaning the same individual can be sampled multiple times.
- 5
- Redevelop the model on the sampled data.
- 6
- Make predictions on the development data using the sample-based prediction model.
- 7
- Calculate the slope.
- 8
- Take the mean of the slopes from all bootstraps.
Calibration slope: 0.9574837
4.1.2 Bootstrapping using the rms package
Alternatively, if we use the rms package, we do not have to code the bootstrapping yourself:
fit_rms <- rms::lrm(d.unfav ~
age +
d.sysbpt +
pupil.i +
cause,
data = development,
x = TRUE,
y = TRUE)
1val_rms <- rms::validate(fit_rms, B = b, bw = FALSE)- 1
-
We set backward = FALSE because we did not perform backward selection. If backward selection were to be applied, it would need to be carried out within each bootstrap. In general, the full modeling strategy, including functional form selection, backward selection, etc., should be applied to each bootstrap sample. Some of these modeling approaches may not be supported by the
rms::validate()function, so implementing them manually, as shown above, could allow for more flexibility and more effectively capture optimism.
We now obtain the uniform shrinkage factor using the bootstrapped calibration slope:
We observe only a very small difference to our manually coded approach.
Now that we have s, we can multiply it with the coefficients of the standard model:
Lastly, using these coefficients, we need to re-estimate the intercept. We do this by using an offset, meaning we estimate the outcome as a function of the linear predictor, where the linear predictor is held constant (i.e. given a regression coefficient of 1).
# Calculate linear predictors
lps <- x_dev %*% as.matrix(coefs)
# Re-estimate intercept
intercept <- glm(development[["d.unfav"]] ~ offset(lps),
family = binomial)[["coefficients"]][[1]]
# Combine intercept and coefficients
coef_uniform <- c(intercept, coefs)
# Compare models
data.frame(standard = fit_standard[["coefficients"]],
uniform_shrinkage = coef_uniform) standard uniform_shrinkage
(Intercept) 15.999290431 15.2980891582
age 0.036289668 0.0347467652
pol(d.sysbpt, 2)d.sysbpt -0.275136026 -0.2634382541
pol(d.sysbpt, 2)d.sysbpt^2 0.001012795 0.0009697343
pupil.ino reactive pupils 1.822828253 1.7453283021
pupil.ione reactive 0.835094883 0.7995897207
causedomestic/fall 0.291831751 0.2794241382
causeMotorbike 0.002009155 0.0019237329
causeother 0.452559793 0.4333186157
causeRoad traffic accident 0.242233817 0.2319349261
4.2 Validation
Prior to validating the model, we will wrangle the existing model fit a bit so that it contains the information relevant for validating the model.
# Update fit with new coefficients
fit_uniform <- fit_standard
# Add updated coefficients
fit_uniform[["coefficients"]] <- coef_uniform
# Recalculate linear predictors
# We perform matrix multiplication of each individual's values with the coefficients minus the intercept (as the matrix x does not contain this). Afterwards, we add the intercept again.
fit_uniform[["linear.predictors"]] <-
1 fit_uniform[["coefficients"]][[1]] +
2 x_dev %*% as.matrix(fit_uniform[["coefficients"]])[-1]
# Recalculate the predicted risk
3fit_uniform[["fitted.values"]] <-
1 / (1 + exp(-fit_uniform[["linear.predictors"]]))- 1
- The linear predictor is \(LP = \alpha + X' \hat{\beta}\), this is the first part \(\alpha\) of this equation.
- 2
- The linear predictor is \(LP = \alpha + X' \hat{\beta}\), this is the second part \(X' \hat{\beta}\) of this equation.
- 3
- To transform the linear predictor into predicted probabilities, we transform them using \(P(Y=1|X)=\frac{1}{1+exp(-LP)}\).
This may seem like a bit of a hassle, especially since the {rms} package already implements this. On the other hand, as discussed in the previous practical, using the {rms} package might have its downsides.
Now let’s see the validation metrics:
tibble(model = c("Standard model", "Uniform shrinkage"),
cstat_dev = c(cstat(fit_standard, x = x_dev, y = y_dev, df = development),
cstat(fit_uniform, x = x_dev, y = y_dev, df = development)),
cstat_val = c(cstat(fit_standard, x = x_val, y = y_val, df = validation),
cstat(fit_uniform, x = x_val, y = y_val, df = validation)),
int_dev = c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_int,
calmetrics(fit_uniform, x = x_dev, y = y_dev, df = development)$cal_int),
int_val = c(calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_int,
calmetrics(fit_uniform, x = x_val, y = y_val, df = validation)$cal_int),
slope_dev = c(calmetrics(fit_standard, x = x_dev, y = y_dev, df = development)$cal_slope,
calmetrics(fit_uniform, x = x_dev, y = y_dev, df = development)$cal_slope),
slope_val = c(calmetrics(fit_standard, x = x_val, y = y_val, df = validation)$cal_slope,
calmetrics(fit_uniform, x = x_val, y = y_val, df = validation)$cal_slope))# A tibble: 2 × 7
model cstat_dev cstat_val int_dev int_val slope_dev slope_val
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Standard model 0.751 0.707 0 0.205 1 0.812
2 Uniform shrinkage 0.751 0.707 0 0.201 1.04 0.848
After applying uniform shrinkage, the C-statistics remained unchanged, since the relative ordering of patients was preserved. In the development data, the standard model performs perfectly, with a calibration intercept of 0 and a slope of 1, whereas uniform shrinkage modifies the calibration slope while leaving the intercept unchanged. In the validation data, the uniformly shrunken model shows slightly better calibration, with a calibration intercept closer to zero (i.e., less underestimation) and a calibration slope closer to 1 (i.e., less overfitting) compared with the standard model.
5 Decision-curve analysis
Finally, we will examine an alternative approach to evaluating model performance: decision curve analysis (DCA). DCA quantifies a model’s clinical usefulness by calculating a net benefit, which weights true positives against the harm of false positives based on a chosen threshold probability. This approach helps determine whether using the model improves decision-making compared to default strategies such as treating all or no patients. Read more about DCA here.
Based on guidance from our TBI experts, the relevant range of threshold probabilities is considered to be between 20% and 40%. Importantly, this range must be determined beforehand and should not be influenced by the DCA analysis. Read more about this here.
# Obtain predictions on validation data
validation$predictions <- preds(fit_standard, x = x_val, y = y_val, df = validation)
# Perform DCA analysis on validation data
1dca.plot <- dcurves::dca(y_val ~ predictions,
2 data = validation,
3 label = list(predictions = "Prediction Model"))- 1
- Specify the formula as the outcome on the predictions.
- 2
- DCA is most informative when applied to validation data, as it captures the model’s true clinical benefit on unseen cases. Using development data may overestimate benefit unless optimism is accounted for.
- 3
- Change the label of the model in the plot.
Assuming '1' is [Event] and '0' is [non-Event]
# Create table with net benefit values at certain thresholds
data.frame(
dca.plot$dca |>
filter(
threshold == 0.2 |
threshold == 0.3 |
threshold == 0.4
) |>
arrange(threshold) |>
select(label, threshold, net_benefit) |>
mutate(threshold = threshold * 100) |>
mutate_if(is.numeric, round, 2)
) label threshold net_benefit
1 Treat All 20 0.26
2 Treat None 20 0.00
3 Prediction Model 20 0.26
4 Treat All 30 0.15
5 Treat None 30 0.00
6 Prediction Model 30 0.19
7 Treat All 40 0.01
8 Treat None 40 0.00
9 Prediction Model 40 0.12
From the Table we observe that at a 30% threshold probability, the prediction model yields a net benefit of 0.19. This means that using the model is equivalent to correctly identifying 19 additional patients per 100 who will experience an unfavourable 6-month outcome, after accounting for the downsides of unnecessary interventions in low-risk patients. Consequently, the model offers greater clinical utility at this threshold than both treating all patients (net benefit 0.15) and treating none (net benefit 0).
6 Final words
We now discussed different shrinkage methods and have seen how to implement these. There are other methods available for shrinkage and these methods can also be applied to regression models other than logistic regression, but you are now familiar with the implementation of some of the most popular approaches.
7 Postscript
This practical was developed on:
R version 4.5.0 (2025-04-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.1 ggplot2_3.5.2 knitr_1.50 pROC_1.19.0.1
[5] rms_8.0-0 Hmisc_5.2-3 glmnet_4.1-9 Matrix_1.7-3
[9] rio_1.2.4 tidyr_1.3.1 magrittr_2.0.3 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] gtable_0.3.6 shape_1.4.6.1 xfun_0.52 htmlwidgets_1.6.4
[5] lattice_0.22-6 vctrs_0.6.5 tools_4.5.0 generics_0.1.4
[9] sandwich_3.1-1 tibble_3.2.1 cluster_2.1.8.1 pacman_0.5.1
[13] R.oo_1.27.1 pkgconfig_2.0.3 data.table_1.17.2 checkmate_2.3.2
[17] RColorBrewer_1.1-3 lifecycle_1.0.4 compiler_4.5.0 farver_2.1.2
[21] stringr_1.5.1 MatrixModels_0.5-4 codetools_0.2-20 SparseM_1.84-2
[25] dcurves_0.5.0 quantreg_6.1 htmltools_0.5.8.1 yaml_2.3.10
[29] htmlTable_2.4.3 Formula_1.2-5 pillar_1.10.2 R.utils_2.13.0
[33] MASS_7.3-65 iterators_1.0.14 rpart_4.1.24 multcomp_1.4-28
[37] foreach_1.5.2 nlme_3.1-168 tidyselect_1.2.1 digest_0.6.37
[41] mvtnorm_1.3-3 polspline_1.1.25 stringi_1.8.7 purrr_1.1.0
[45] labeling_0.4.3 splines_4.5.0 fastmap_1.2.0 grid_4.5.0
[49] colorspace_2.1-1 cli_3.6.5 base64enc_0.1-3 utf8_1.2.5
[53] survival_3.8-3 TH.data_1.1-3 withr_3.0.2 foreign_0.8-90
[57] scales_1.4.0 backports_1.5.0 rmarkdown_2.29 nnet_7.3-20
[61] gridExtra_2.3 R.methodsS3_1.8.2 zoo_1.8-14 evaluate_1.0.3
[65] mgcv_1.9-1 rlang_1.1.6 Rcpp_1.0.14 glue_1.8.0
[69] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1