INTRODUCTION

Modeling Injury Risk: Beyond Binary Outcomes

In high-performance sports, asking “Will this player get injured this season?” is often the wrong question. A simple Yes or No prediction (binary classification) misses the most critical dimension of performance staff decision-making: Time.

Of course, knowing a player is “at risk” is helpful, let’s not disregard that.

But knowing when that risk is highest allows for targeted load management and intervention.

And this is where Survival Analysis comes in.

WHAT is Survival Analysis?

Survival Analysis is a branch of statistics designed to analyse time-to-event data. Instead of predicting a static outcome, it models the duration until an event of interest occurs.

While the name originates from medical research (modeling the time until death), in sports analytics, we simply reframe the parameters:

  • The Subject: The Athlete.

  • The Event: The Injury.

  • The Survival Function: The probability that a player remains healthy past a specific point in the season.

The Key Differentiator: Censoring

You might ask: Why not just use standard Linear Regression model to predict the number of days until an injury?

The answer lies in a specific data problem called Right Censoring.

In any given season, many players will not get injured.

If we simply drop these healthy players from our analysis, we lose valuable data (the fact that they “survived” high workloads).

If we treat them as if they got injured on the last day of the season, we bias our results significantly.

Survival Analysis handles this mathematically by distinguishing between two types of observation:

  1. Uncensored (Event): The player sustained an injury during the study window. We know exactly when it happened.
  2. Censored: The season ended (or the player was transferred) before an injury occurred. We know they were healthy at least until that point, but the “event” never happened within our view.

Goal of this Tutorial

By utilising Survival Analysis, we can move beyond simple frequency counts. We can model the Hazard Rate (i.e. instantaneous risk) to answer dynamic questions, such as:

How does the risk of injury accumulate as the season progresses?

Do older players succumb to injury faster than younger ones, or is the risk profile constant?

To ensure our analysis is contextually relevant to a single football club, we will not use random unrelated samples. Instead, we will analyze a 25-man squad tracked over 4 consecutive seasons.

This structure (N = 100 observations) allows us to account for the reality of club football: the same players returning year after year, but with changing risk profiles as they age.

This tutorial uses the survival package in R to model season-start risk factors (current age, playing position, and squad role as a proxy for exposure) to predict the time-to-injury profile for a squad.

What output are we expecting?

  • Kaplan-Meier Survival Curves to visualise injury-free probabilities over time.

  • Cox Proportional Hazards Regression to quantify the effect of each risk factor on injury

  • Professional tables and forest plots to communicate findings effectively to coaching staff.

Scope Note: I should have probably led with this, but this analysis specifically models NON-CONTACT SOFT TISSUE INJURIES. Traumatic contact injuries (e.g., broken bones from tackles) are treated as censored events, as they are considered random accidents rather than systemic failures of load management.

MAIN ANALYSIS

Step 1: Load Our Toolkit

First, we load our packages. We’ll use several R packages that are essential for survival analysis and visualisation:

  • survival: The core package for all survival analysis in R (creating Surv objects, running survfit and coxph).
  • survminer: Makes beautiful ggplot2-based survival plots.
  • gtsummary: Creates publication-ready tables for our regression models.
  • sjPlot: Excellent for plotting model results (like our final forest plot).
  • ggplot2: For general plotting.
# 1. LOAD REQUIRED PACKAGES ---------------------------------
library(tidyverse)
library(survival)
library(gtsummary)
library(sjPlot)
library(ggplot2)
library(survminer) # For ggsurvplot

Step 2: The Data & The Concept of “Censoring”

In a real project, we’d probably load a CSV pulled from a software (perhaps via an API, or manual export from the likes of Catapult, Hudl etc).

Anyway, because we don’t have that, we’ll create a mock dataset for 100 players.

The two most important columns in any survival dataset are:

  1. time: The duration of follow-up (our days_to_injury).
  2. status: The event outcome (1 = Event happened, 0 = Censored).

What is Censoring?

“Censoring” is the most important concept in survival analysis. A status of 0 means the player did not have an injury during our observation period.

This is crucial: not having an injury is different from being “healthy.”

The above illustrated diagram may help to visualise these concepts better.
The above illustrated diagram may help to visualise these concepts better.

A player is censored (status=0) if:

1.The season ended and they were injury-free (e.g., observed for 270 days).

2.They were sold in the transfer window (e.g., observed for 150 days).

3.They had a different event (like a contact injury) that took them out of the study.

Survival models are powerful because they correctly use the information from these 150- or 270-day “survivors,” whereas other models might throw it away.

library(survival)
library(dplyr)
set.seed(999) # New seed for variety

# --- 1. Setup the Squad (The Humans) ---
player_ids <- 1:25
positions <- c(rep("GK", 3), rep("DF", 8), rep("MF", 8), rep("FW", 6))
base_ages <- round(rnorm(25, mean = 22, sd = 3))
base_ages <- pmax(pmin(base_ages, 32), 17) # 17 to 32 start age
inherent_risk <- runif(25, 0.7, 1.3) # Injury History factor

squad_ref <- data.frame(
  player_id = player_ids, 
  position = positions, 
  base_age = base_ages, 
  inherent_risk = inherent_risk
)

# --- 2. Expand to 4 Seasons (The Context) ---
sim_data <- expand.grid(player_id = player_ids, season = 1:4) %>%
  left_join(squad_ref, by = "player_id") %>%
  mutate(current_age = base_age + (season - 1))

# --- 3. Assign Roles & Exposure (The Realism) ---

# Logic: Players grow into "Key" roles between 24-29. 
# Youngsters (<21) and Veterans (>31) are likely Rotation/Fringe.
assign_role <- function(age, pos, risk_factor) {
  # Random noise
  val <- runif(1)
  
  if (age >= 24 & age <= 30) {
    if(val > 0.3) return("Key Starter") else return("Rotation")
  } else {
    if(val > 0.6) return("Rotation") else return("Fringe")
  }
}

sim_data$squad_role <- mapply(assign_role, sim_data$current_age, sim_data$position, sim_data$inherent_risk)

# Assign Expected Minutes based on Role (League + Continental Load)
# Key Starter: ~3000-4000 mins
# Rotation: ~1500-2500 mins
# Fringe: ~500-1000 mins
sim_data <- sim_data %>%
  mutate(
    expected_minutes = case_when(
      squad_role == "Key Starter" ~ runif(n(), 3200, 4200),
      squad_role == "Rotation"    ~ runif(n(), 1500, 2800),
      squad_role == "Fringe"      ~ runif(n(), 200, 1200)
    ),
    # Calculate "Intensity" (Minutes per day average)
    daily_load = expected_minutes / 270
  )

# --- 4. Simulate Survival (ADJUSTED FOR REALISM) ---

sim_data <- sim_data %>%
  mutate(
    # Risk Formula:
    # 1. Reduced Age impact slightly (0.03 -> 0.02)
    # 2. Reduced Daily Load impact significantly (0.15 -> 0.08)
    #    Starters will now have ~3x risk of Fringe, not 10x.
    risk_score = exp(
      (current_age * 0.02) +             
      (ifelse(position == "FW", 0.4, 0)) +
      (daily_load * 0.08) +              
      (inherent_risk * 0.2)
    )
  )

# Increased base scale from 1100 -> 4500
# This counteracts the risk_score to ensure the average player lasts longer
sim_data$days_to_injury <- rweibull(n = nrow(sim_data), shape = 1.3, scale = 4500 / sim_data$risk_score)

# --- 5. Clean & Calculate Finals ---
season_length <- 270

sim_data <- sim_data %>%
  mutate(
    status = ifelse(days_to_injury <= season_length, 1, 0),
    time = pmin(days_to_injury, season_length),
    actual_minutes_played = round(daily_load * time)
  ) %>%
  select(player_id, season, position, squad_role, current_age, expected_minutes, actual_minutes_played, time, status)

# Check the mix again - aim for Key Starter breakdown to be roughly 15 (Injured) / 13 (Safe)
table(sim_data$squad_role, sim_data$status)
##              
##                0  1
##   Fringe      32  5
##   Key Starter 18 10
##   Rotation    31  4
summary(sim_data)
##    player_id      season       position          squad_role       
##  Min.   : 1   Min.   :1.00   Length:100         Length:100        
##  1st Qu.: 7   1st Qu.:1.75   Class :character   Class :character  
##  Median :13   Median :2.50   Mode  :character   Mode  :character  
##  Mean   :13   Mean   :2.50                                        
##  3rd Qu.:19   3rd Qu.:3.25                                        
##  Max.   :25   Max.   :4.00                                        
##   current_age   expected_minutes actual_minutes_played      time       
##  Min.   :17.0   Min.   : 211.6   Min.   : 158.0        Min.   : 16.48  
##  1st Qu.:20.0   1st Qu.: 894.8   1st Qu.: 868.8        1st Qu.:270.00  
##  Median :23.0   Median :1907.2   Median :1827.0        Median :270.00  
##  Mean   :22.7   Mean   :2027.9   Mean   :1860.0        Mean   :253.51  
##  3rd Qu.:25.0   3rd Qu.:3237.5   3rd Qu.:2661.8        3rd Qu.:270.00  
##  Max.   :29.0   Max.   :4171.1   Max.   :3994.0        Max.   :270.00  
##      status    
##  Min.   :0.00  
##  1st Qu.:0.00  
##  Median :0.00  
##  Mean   :0.19  
##  3rd Qu.:0.00  
##  Max.   :1.00

The sim_data data frame now contains our mock dataset for survival analysis. We are simulating a longitudinal dataset of 25 players tracked over 4 seasons (N=100).

Variable Type Description
player_id Integer Unique identifier for the athlete (1-25).
season Integer The season number (1-4). Note that players age as seasons progress.
position Factor The tactical role: Goalkeeper (GK), Defender (DF), Midfielder (MF), or Forward (FW).
squad_role Factor The player’s status in the hierarchy (Key Starter, Rotation, Fringe). This determines their expected match load.
current_age Numeric The player’s biological age at the start of the specific season.
actual_minutes Numeric Total match minutes accumulated over the season. Used descriptively in this tutorial.
time Numeric Duration. The number of days from the start of the season (Day 0) until the Event or Censoring. Max = 270 days.
status Binary

The Event Indicator.

1 = Significant non-contact injury occurred (e.g. muscle strain, ligament stress).

0 = Censored (Player finished the season healthy).

Note: Contact injuries (e.g., tackles, collisions) are treated as censored events because they are generally unrelated to the intrinsic or load-dependent risk factors (such as age and minutes played) modeled here.

Step 3: Create the Survival Object

We combine time and status into a single Surv object.

Notice the + signs in the output below. These represent censored players (those who lasted the full 270 days).

# 3. CREATE THE SURVIVAL OBJECT -----------------------------
surv_obj <- Surv(time = sim_data$time, event = sim_data$status)

print("--- First 10 Survival Objects ---")
## [1] "--- First 10 Survival Objects ---"
surv_obj[1:10]
##  [1] 218.5890  270.0000+ 270.0000+ 270.0000+ 270.0000+ 270.0000+ 270.0000+
##  [8] 270.0000+ 270.0000+ 179.5455

Step 4: Kaplan-Meier Curve (The “What”)

The Kaplan-Meier (KM) estimator is a non-parametric method to estimate the survival function. In plain English, it plots the “Probability of Remaining Injury-Free” over time.

We will separate these plots to answer three specific questions:

4a. The “Price of Playing”: Risk by Squad Role

Question: Do “Key Starters” burn out faster than “Fringe” players?

# 4. Stratified Analysis by Squad Role
km_fit_role <- survfit(surv_obj ~ squad_role, data = sim_data)

# Plot the curves
ggsurvplot(
  km_fit_role,
  data = sim_data,
  conf.int = TRUE,         # <--- Adds the Confidence Intervals in shaded area
  pval = TRUE,             # Show Log-Rank p-value
  risk.table = TRUE,       # Show table of "Number at Risk" below
  risk.table.height = 0.25,
  palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
  xlim = c(0, 270),
  break.time.by = 45,      # Break x-axis every 1.5 months
  title = "Injury-Free Probability by Squad Role",
  xlab = "Days into Season",
  legend.title = "Role",
  legend.labs = c("Fringe", "Key Starter", "Rotation")
)

How to read this graph:

The Drop: Notice how the Key Starter line (Yellow) drops significantly faster than the Fringe line (Blue). This gap represents the increased risk of injury accumulation.

The P-Value: A p-value < 0.05 would mean the difference between these lines is statistically significant.

4b. The Tactical View: Risk by Position

Question: Which position gets injured the most?

Interpretation: The lines here are closer together. If the p-value is high (>0.05), it means we cannot prove that one position is riskier than another based on this data alone. The “drops” might just be random noise.

# --- 4b. Stratified by Position (The Tactical View) ---
# Do forwards burn out faster than Goalkeepers?

km_fit_pos <- survfit(surv_obj ~ position, data = sim_data)

ggsurvplot(
  km_fit_pos,
  data = sim_data,
  conf.int = TRUE,         # <--- Adds the Confidence Intervals in shaded area
  pval = TRUE,
  risk.table = TRUE,
  risk.table.height = 0.25,
  palette = "jco",         # "jco" is a nice academic color palette
  xlim = c(0, 270),
  title = "Injury-Free Probability by Position",
  xlab = "Days into Season",
  legend.title = "Position"
)

4c. The Aging Curve: U23s vs Seniors

Question: Which age group of players are more likely to survive the full season?

# --- 4c. Stratified by Age (3-Tier) ---

# 1. Create 3 distinct age groups using case_when
sim_data <- sim_data %>%
  mutate(
    age_group = case_when(
      current_age <= 23 ~ "U23 (Young)",
      current_age >= 30 ~ "Veteran (30+)",
      TRUE ~ "Senior (24-29)" # Captures everyone in the middle
    )
  )

# 2. Fit the model
km_fit_age <- survfit(surv_obj ~ age_group, data = sim_data)

# 3. Plot with 3 colors
ggsurvplot(
  km_fit_age,
  data = sim_data,
  conf.int = TRUE,         # <--- Adds the Confidence Intervals in shaded area
  pval = TRUE,
  risk.table = TRUE,
  risk.table.height = 0.25,
  # Palette: Blue (Young), Orange (Senior), Red (Veteran)
  palette = c("#2E9FDF", "#E7B800", "#FC4E07"), 
  xlim = c(0, 270),
  title = "Injury-Free Probability: Young vs Senior vs Veteran",
  xlab = "Days into Season",
  legend.title = "Age Group"
)

Interpretation: Look at the end of the season (Day 200+). Does the Senior line (Blue) diverge away from the Young line (Yellow)? This visual divergence suggests that older players may struggle more with cumulative fatigue late in the season.There are also no veterans in this dataset.

Step 5: Cox Proportional Hazards Regression (Quantifying Risk)

The Kaplan-Meier plots are great, but they only look at one variable at a time. To understand the true drivers of injury, we need a Multivariate Model that controls for everything at once.

We use Cox Proportional Hazards (CPH). This model produces a Hazard Ratio (HR) for each factor.

We will build a multivariate model controlling for:

  • Age: Do older players break easier?

  • Position: Are Forwards riskier than GKs?

  • Squad Role: Is it the role that kills them?

  • Cluster: We account for the fact that Player #1 appears 4 times (repeated measures).

# 5. Multivariate Cox Model
fit_multi <- coxph(
  surv_obj ~ current_age + position + squad_role + cluster(player_id),
  data = sim_data
)

summary(fit_multi)
## Call:
## coxph(formula = surv_obj ~ current_age + position + squad_role, 
##     data = sim_data, cluster = player_id)
## 
##   n= 100, number of events= 19 
## 
##                           coef exp(coef) se(coef) robust se      z Pr(>|z|)  
## current_age           -0.05947   0.94227  0.12134   0.07018 -0.847    0.397  
## positionFW             0.87891   2.40828  0.63429   0.57875  1.519    0.129  
## positionGK             0.13262   1.14182  0.88792   1.20105  0.110    0.912  
## positionMF             0.18637   1.20487  0.67361   0.69607  0.268    0.789  
## squad_roleKey Starter  1.45808   4.29769  0.78306   0.62155  2.346    0.019 *
## squad_roleRotation    -0.05706   0.94453  0.70008   0.77410 -0.074    0.941  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                       exp(coef) exp(-coef) lower .95 upper .95
## current_age              0.9423     1.0613    0.8212     1.081
## positionFW               2.4083     0.4152    0.7746     7.488
## positionGK               1.1418     0.8758    0.1085    12.021
## positionMF               1.2049     0.8300    0.3079     4.715
## squad_roleKey Starter    4.2977     0.2327    1.2711    14.531
## squad_roleRotation       0.9445     1.0587    0.2072     4.307
## 
## Concordance= 0.701  (se = 0.059 )
## Likelihood ratio test= 9.65  on 6 df,   p=0.1
## Wald test            = 12.73  on 6 df,   p=0.05
## Score (logrank) test = 10.8  on 6 df,   p=0.09,   Robust = 5.63  p=0.5
## 
##   (Note: the likelihood ratio and score tests assume independence of
##      observations within a cluster, the Wald and robust score tests do not).

Step 6: Reporting the Results

We use gtsummary and sjPlot to visualise the Hazard Ratios.

6a. The Professional Table

This table provides the exact numbers for your report.

# 6. Create a professional table
tbl_regression(
  fit_multi,
  exponentiate = TRUE, # Convert Coef -> Hazard Ratios
  label = list(
    current_age ~ "Player Age",
    position ~ "Position",
    squad_role ~ "Squad Role"
  )
)
Characteristic HR 95% CI p-value
Player Age 0.94 0.82, 1.08 0.4
Position


    DF
    FW 2.41 0.77, 7.49 0.13
    GK 1.14 0.11, 12.0 >0.9
    MF 1.20 0.31, 4.71 0.8
Squad Role


    Fringe
    Key Starter 4.30 1.27, 14.5 0.019
    Rotation 0.94 0.21, 4.31 >0.9
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

6b. The Forest Plot (Visualising the Risk)

This is probably an effective way to communicate to the non-statistical staff/coaches.

# 7. Forest Plot
plot_model(
  fit_multi,
  show.values = TRUE,
  exponentiate = TRUE,
  title = "Hazard Ratios for Time-Loss Injury",
  vline.color = "grey"
)

How to read the Forest Plot:

  1. The Vertical Line (at 1.0): This is the “Line of No Effect.” If a result touches this line, it is not statistically significant.
  2. The Blue Dot: This is the Hazard Ratio.

1.0: Higher Risk (Right side).

< 1.0: Lower Risk / Protective (Left side).

  1. The Horizontal Line (Whiskers): This is the Confidence Interval. If this line crosses the vertical grey line, we cannot be sure of the result.
  2. Reference Groups: “Key Starters” are being compared to the baseline group (Fringe Players). “Forwards” are compared to Defenders.

Step 7: Conclusion: Interpreting the Data for the Coaches

Unlike simple counts, Survival Analysis allows us to separate “Bad Luck” from Systemic Risk.

Based on the simulated data, here is the actionable insight can be presented to the backroom team:

1. The “Cost” of Being a Starter

The model identifies Squad Role as a highly significant predictor (p<0.05).

Key Starters have a Hazard Ratio (HR) significantly greater than 1.0 compared to Fringe players.

Translation: A Key Starter is 4.30 times more likely to suffer a time-loss injury on any given day than a Fringe player. This model treats squad_role as a proxy for exposure and suggests that players in Key Starter roles have substantially higher injury hazard than Fringe players.

2. Age & Position

Age: The model estimates a hazard ratio of 0.94 per additional year of age (95% CI 0.82–1.08, p = 0.4), suggesting no clear evidence that age meaningfully changes injury risk in this simulated, relatively young squad.

Given the young age profile (mean ≈ 22.7 years), age effects may only become apparent once older players (30+) are included.

Position: While Forwards may show a trend toward higher risk, the confidence intervals (whiskers in the Forest Plot) likely overlap with 1.0, suggesting that position alone is not the primary driver of risk…workload is.

RECOMMENDATION?

In this simulated data set, Key Starters show an estimated hazard ratio of 4.30 (1.27–14.5) versus Fringe players, indicating substantially higher risk under higher exposure.

The medical team cannot “fix” a player’s age. However, this simulation illustrates how rotation can function as a medical intervention by reducing exposure‐related risk.

Reducing a “Key Starter” to a “Rotation” role for 2-3 weeks during the season effectively lowers their hazard rate and extends their survival probability significantly.

At this point, you may start to roll your eyes and say, “YA DON’T SAY, CAPTAIN OBVIOUS!”

While it is intuitive that high workload increases risk, this analysis quantifies exactly when and how that risk becomes critical.

1. The “135-Day Wall”

Our Kaplan-Meier curves for Key Starters (Figure 1) show a relatively stable survival rate for the first 4 months. However, we identify a critical inflection point around Day 135 (Mid-Season).

  • Insight: Up until Day 135, the risk profile of a Starter is manageable. After Day 135, the survival probability crashes by nearly 40%.

  • Action: Rotation strategies shouldn’t be uniform. You can “ride” your starters for the first 3 months, but mandatory rotation cycles must kick in before Day 135 to avoid the statistical crash.

2. Quantifying the Gamble (The Hazard Ratio)

We found a Hazard Ratio of 4.30 for Key Starters.

  • Translation for the Coach: Every time you play a Key Starter in a “Fringe” game (e.g., early Cup round) instead of resting them, you are accepting an injury hazard consistent with what we estimate for Key Starters across the season, that is, approximately 4x higher than Fringe players in this model.

  • Decision Support: Is the match importance high enough to justify a 400% increase in injury probability? If not, the model suggests that rotation is the safer option from an injury risk‑management perspective.

Limitations & Future Directions

While this tutorial provides a solid foundation for survival analysis in football injury risk, several limitations must be acknowledged:

  • Age Profile: Our current squad is relatively young (average age ~22). As we recruit older players, we would expect “Age” to become a more significant risk factor, requiring a separate analysis for the 30+ cohort.

  • Training Load: This model uses Match Minutes as a proxy for load. Future iterations should incorporate GPS training load data to capture the “hidden” work done on the training pitch.

Also, “Fringe/rotation/key starter” in this tutorial are simulated abstractions tied to expected minutes. In a real club setting they should be operationalised from objective exposure data. Some might also be quick to point out that DF, MF, FW, GK are broad positional groupings that do not fully capture modern tactical roles and responsibilities.

Further, applying it to a live high-performance environment requires addressing several real-world complexities that this simulation simplifies.

1. The “Mid-Season” Problem (Left Truncation)

Our current model assumes a static cohort where every player is present from Day 0 with no additions/new signings to the squad.

Real-World Scenario: A new striker is signed in the Mid Season Transfer Window (Day 150).

The Issue: If we simply add them to the dataset, the model might incorrectly assume they “survived” the first 150 days injury-free. In reality, they were not “at risk” in our environment during that period.

Future Solution: Advanced survival models handle this using Left Truncation (Delayed Entry). This ensures a player only contributes to the “At Risk” calculation during the specific window they are actually present at the club.

2. Departing Players (Informative Censoring)

We currently treat all censoring (end of season vs. transfer out) as “non-informative”, meaning the reason for leaving is unrelated to injury risk.

Real-World Scenario: A player is sold because they are injury-prone, or a player is released because they cannot handle the training load.

The Issue: If high-risk players are systematically removed from the study (sold), our remaining pool looks artificially healthier than it is.

Future Solution: Conducting a Competing Risks Analysis to distinguish between “Random Censoring” (e.g., End of Loan) and “Informative Censoring” (e.g., Medical Release).

3. Recurrent Injuries

Standard Survival Analysis usually stops at the “First Failure.” Once a player gets injured (status = 1), they drop out of the study.

Real-World Scenario: A winger strains their hamstring in August, returns in October, and strains it again in December.

The Issue: By ignoring the second and third injuries, we underestimate the total burden of injury on the squad. Furthermore, the first injury often increases the hazard rate for the second.

4. Training Load vs. Match Exposure

Our model currently uses squad_role as a proxy for exposure. actual_minutes is available but is only used descriptively; a full time‑varying implementation is left for future work

Real-World Scenario: A “Fringe” player might play zero match minutes but be subjected to intense “top-up” conditioning sessions to maintain fitness.

The Issue: The model views this player as “low exposure,” but their physiological load (and tissue stress) might actually be high.

Future Solution: Integrating GPS / External Load data (Total Distance, High-Speed Running, Athlete Load) as the exposure metric rather than simple Match Minutes. This would provide a complete picture of the mechanical stressors placed on the athlete.