MAIN ANALYSIS
Step 1: Load Our Toolkit
First, we load our packages. We’ll use several R packages that are essential for survival analysis and visualisation:
survival: The core package for all survival analysis in R (creatingSurvobjects, runningsurvfitandcoxph).survminer: Makes beautifulggplot2-based survival plots.gtsummary: Creates publication-ready tables for our regression models.sjPlot: Excellent for plotting model results (like our final forest plot).ggplot2: For general plotting.
Step 2: The Data & The Concept of “Censoring”
In a real project, we’d probably load a CSV pulled from a software (perhaps via an API, or manual export from the likes of Catapult, Hudl etc).
Anyway, because we don’t have that, we’ll create a mock dataset for 100 players.
The two most important columns in any survival dataset are:
time: The duration of follow-up (ourdays_to_injury).status: The event outcome (1 = Event happened, 0 = Censored).
What is Censoring?
“Censoring” is the most important concept in survival analysis. A
status of 0 means the player did not have an
injury during our observation period.
This is crucial: not having an injury is different from being “healthy.”
A player is censored (status=0) if:
1.The season ended and they were injury-free (e.g., observed for 270 days).
2.They were sold in the transfer window (e.g., observed for 150 days).
3.They had a different event (like a contact injury) that took them out of the study.
Survival models are powerful because they correctly use the information from these 150- or 270-day “survivors,” whereas other models might throw it away.
library(survival)
library(dplyr)
set.seed(999) # New seed for variety
# --- 1. Setup the Squad (The Humans) ---
player_ids <- 1:25
positions <- c(rep("GK", 3), rep("DF", 8), rep("MF", 8), rep("FW", 6))
base_ages <- round(rnorm(25, mean = 22, sd = 3))
base_ages <- pmax(pmin(base_ages, 32), 17) # 17 to 32 start age
inherent_risk <- runif(25, 0.7, 1.3) # Injury History factor
squad_ref <- data.frame(
player_id = player_ids,
position = positions,
base_age = base_ages,
inherent_risk = inherent_risk
)
# --- 2. Expand to 4 Seasons (The Context) ---
sim_data <- expand.grid(player_id = player_ids, season = 1:4) %>%
left_join(squad_ref, by = "player_id") %>%
mutate(current_age = base_age + (season - 1))
# --- 3. Assign Roles & Exposure (The Realism) ---
# Logic: Players grow into "Key" roles between 24-29.
# Youngsters (<21) and Veterans (>31) are likely Rotation/Fringe.
assign_role <- function(age, pos, risk_factor) {
# Random noise
val <- runif(1)
if (age >= 24 & age <= 30) {
if(val > 0.3) return("Key Starter") else return("Rotation")
} else {
if(val > 0.6) return("Rotation") else return("Fringe")
}
}
sim_data$squad_role <- mapply(assign_role, sim_data$current_age, sim_data$position, sim_data$inherent_risk)
# Assign Expected Minutes based on Role (League + Continental Load)
# Key Starter: ~3000-4000 mins
# Rotation: ~1500-2500 mins
# Fringe: ~500-1000 mins
sim_data <- sim_data %>%
mutate(
expected_minutes = case_when(
squad_role == "Key Starter" ~ runif(n(), 3200, 4200),
squad_role == "Rotation" ~ runif(n(), 1500, 2800),
squad_role == "Fringe" ~ runif(n(), 200, 1200)
),
# Calculate "Intensity" (Minutes per day average)
daily_load = expected_minutes / 270
)
# --- 4. Simulate Survival (ADJUSTED FOR REALISM) ---
sim_data <- sim_data %>%
mutate(
# Risk Formula:
# 1. Reduced Age impact slightly (0.03 -> 0.02)
# 2. Reduced Daily Load impact significantly (0.15 -> 0.08)
# Starters will now have ~3x risk of Fringe, not 10x.
risk_score = exp(
(current_age * 0.02) +
(ifelse(position == "FW", 0.4, 0)) +
(daily_load * 0.08) +
(inherent_risk * 0.2)
)
)
# Increased base scale from 1100 -> 4500
# This counteracts the risk_score to ensure the average player lasts longer
sim_data$days_to_injury <- rweibull(n = nrow(sim_data), shape = 1.3, scale = 4500 / sim_data$risk_score)
# --- 5. Clean & Calculate Finals ---
season_length <- 270
sim_data <- sim_data %>%
mutate(
status = ifelse(days_to_injury <= season_length, 1, 0),
time = pmin(days_to_injury, season_length),
actual_minutes_played = round(daily_load * time)
) %>%
select(player_id, season, position, squad_role, current_age, expected_minutes, actual_minutes_played, time, status)
# Check the mix again - aim for Key Starter breakdown to be roughly 15 (Injured) / 13 (Safe)
table(sim_data$squad_role, sim_data$status)##
## 0 1
## Fringe 32 5
## Key Starter 18 10
## Rotation 31 4
## player_id season position squad_role
## Min. : 1 Min. :1.00 Length:100 Length:100
## 1st Qu.: 7 1st Qu.:1.75 Class :character Class :character
## Median :13 Median :2.50 Mode :character Mode :character
## Mean :13 Mean :2.50
## 3rd Qu.:19 3rd Qu.:3.25
## Max. :25 Max. :4.00
## current_age expected_minutes actual_minutes_played time
## Min. :17.0 Min. : 211.6 Min. : 158.0 Min. : 16.48
## 1st Qu.:20.0 1st Qu.: 894.8 1st Qu.: 868.8 1st Qu.:270.00
## Median :23.0 Median :1907.2 Median :1827.0 Median :270.00
## Mean :22.7 Mean :2027.9 Mean :1860.0 Mean :253.51
## 3rd Qu.:25.0 3rd Qu.:3237.5 3rd Qu.:2661.8 3rd Qu.:270.00
## Max. :29.0 Max. :4171.1 Max. :3994.0 Max. :270.00
## status
## Min. :0.00
## 1st Qu.:0.00
## Median :0.00
## Mean :0.19
## 3rd Qu.:0.00
## Max. :1.00
The sim_data data frame now contains our mock dataset
for survival analysis. We are simulating a longitudinal dataset of 25
players tracked over 4 seasons (N=100).
| Variable | Type | Description |
player_id |
Integer | Unique identifier for the athlete (1-25). |
season |
Integer | The season number (1-4). Note that players age as seasons progress. |
position |
Factor | The tactical role: Goalkeeper (GK), Defender (DF), Midfielder (MF), or Forward (FW). |
squad_role |
Factor | The player’s status in the hierarchy (Key Starter, Rotation, Fringe). This determines their expected match load. |
current_age |
Numeric | The player’s biological age at the start of the specific season. |
actual_minutes |
Numeric | Total match minutes accumulated over the season. Used descriptively in this tutorial. |
time |
Numeric | Duration. The number of days from the start of the season (Day 0) until the Event or Censoring. Max = 270 days. |
status |
Binary | The Event Indicator.
Note: Contact injuries (e.g., tackles, collisions) are treated as censored events because they are generally unrelated to the intrinsic or load-dependent risk factors (such as age and minutes played) modeled here. |
Step 3: Create the Survival Object
We combine time and status into a single Surv object.
Notice the + signs in the output below. These represent censored players (those who lasted the full 270 days).
# 3. CREATE THE SURVIVAL OBJECT -----------------------------
surv_obj <- Surv(time = sim_data$time, event = sim_data$status)
print("--- First 10 Survival Objects ---")## [1] "--- First 10 Survival Objects ---"
## [1] 218.5890 270.0000+ 270.0000+ 270.0000+ 270.0000+ 270.0000+ 270.0000+
## [8] 270.0000+ 270.0000+ 179.5455
Step 4: Kaplan-Meier Curve (The “What”)
The Kaplan-Meier (KM) estimator is a non-parametric method to estimate the survival function. In plain English, it plots the “Probability of Remaining Injury-Free” over time.
We will separate these plots to answer three specific questions:
4a. The “Price of Playing”: Risk by Squad Role
Question: Do “Key Starters” burn out faster than “Fringe” players?
# 4. Stratified Analysis by Squad Role
km_fit_role <- survfit(surv_obj ~ squad_role, data = sim_data)
# Plot the curves
ggsurvplot(
km_fit_role,
data = sim_data,
conf.int = TRUE, # <--- Adds the Confidence Intervals in shaded area
pval = TRUE, # Show Log-Rank p-value
risk.table = TRUE, # Show table of "Number at Risk" below
risk.table.height = 0.25,
palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
xlim = c(0, 270),
break.time.by = 45, # Break x-axis every 1.5 months
title = "Injury-Free Probability by Squad Role",
xlab = "Days into Season",
legend.title = "Role",
legend.labs = c("Fringe", "Key Starter", "Rotation")
)How to read this graph:
The Drop: Notice how the Key Starter line (Yellow) drops significantly faster than the Fringe line (Blue). This gap represents the increased risk of injury accumulation.
The P-Value: A p-value < 0.05 would mean the difference between these lines is statistically significant.
4b. The Tactical View: Risk by Position
Question: Which position gets injured the most?
Interpretation: The lines here are closer together. If the p-value is high (>0.05), it means we cannot prove that one position is riskier than another based on this data alone. The “drops” might just be random noise.
# --- 4b. Stratified by Position (The Tactical View) ---
# Do forwards burn out faster than Goalkeepers?
km_fit_pos <- survfit(surv_obj ~ position, data = sim_data)
ggsurvplot(
km_fit_pos,
data = sim_data,
conf.int = TRUE, # <--- Adds the Confidence Intervals in shaded area
pval = TRUE,
risk.table = TRUE,
risk.table.height = 0.25,
palette = "jco", # "jco" is a nice academic color palette
xlim = c(0, 270),
title = "Injury-Free Probability by Position",
xlab = "Days into Season",
legend.title = "Position"
)4c. The Aging Curve: U23s vs Seniors
Question: Which age group of players are more likely to survive the full season?
# --- 4c. Stratified by Age (3-Tier) ---
# 1. Create 3 distinct age groups using case_when
sim_data <- sim_data %>%
mutate(
age_group = case_when(
current_age <= 23 ~ "U23 (Young)",
current_age >= 30 ~ "Veteran (30+)",
TRUE ~ "Senior (24-29)" # Captures everyone in the middle
)
)
# 2. Fit the model
km_fit_age <- survfit(surv_obj ~ age_group, data = sim_data)
# 3. Plot with 3 colors
ggsurvplot(
km_fit_age,
data = sim_data,
conf.int = TRUE, # <--- Adds the Confidence Intervals in shaded area
pval = TRUE,
risk.table = TRUE,
risk.table.height = 0.25,
# Palette: Blue (Young), Orange (Senior), Red (Veteran)
palette = c("#2E9FDF", "#E7B800", "#FC4E07"),
xlim = c(0, 270),
title = "Injury-Free Probability: Young vs Senior vs Veteran",
xlab = "Days into Season",
legend.title = "Age Group"
)Interpretation: Look at the end of the season (Day 200+). Does the Senior line (Blue) diverge away from the Young line (Yellow)? This visual divergence suggests that older players may struggle more with cumulative fatigue late in the season.There are also no veterans in this dataset.
Step 5: Cox Proportional Hazards Regression (Quantifying Risk)
The Kaplan-Meier plots are great, but they only look at one variable at a time. To understand the true drivers of injury, we need a Multivariate Model that controls for everything at once.
We use Cox Proportional Hazards (CPH). This model produces a Hazard Ratio (HR) for each factor.
We will build a multivariate model controlling for:
Age: Do older players break easier?
Position: Are Forwards riskier than GKs?
Squad Role: Is it the role that kills them?
Cluster: We account for the fact that Player #1 appears 4 times (repeated measures).
# 5. Multivariate Cox Model
fit_multi <- coxph(
surv_obj ~ current_age + position + squad_role + cluster(player_id),
data = sim_data
)
summary(fit_multi)## Call:
## coxph(formula = surv_obj ~ current_age + position + squad_role,
## data = sim_data, cluster = player_id)
##
## n= 100, number of events= 19
##
## coef exp(coef) se(coef) robust se z Pr(>|z|)
## current_age -0.05947 0.94227 0.12134 0.07018 -0.847 0.397
## positionFW 0.87891 2.40828 0.63429 0.57875 1.519 0.129
## positionGK 0.13262 1.14182 0.88792 1.20105 0.110 0.912
## positionMF 0.18637 1.20487 0.67361 0.69607 0.268 0.789
## squad_roleKey Starter 1.45808 4.29769 0.78306 0.62155 2.346 0.019 *
## squad_roleRotation -0.05706 0.94453 0.70008 0.77410 -0.074 0.941
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## current_age 0.9423 1.0613 0.8212 1.081
## positionFW 2.4083 0.4152 0.7746 7.488
## positionGK 1.1418 0.8758 0.1085 12.021
## positionMF 1.2049 0.8300 0.3079 4.715
## squad_roleKey Starter 4.2977 0.2327 1.2711 14.531
## squad_roleRotation 0.9445 1.0587 0.2072 4.307
##
## Concordance= 0.701 (se = 0.059 )
## Likelihood ratio test= 9.65 on 6 df, p=0.1
## Wald test = 12.73 on 6 df, p=0.05
## Score (logrank) test = 10.8 on 6 df, p=0.09, Robust = 5.63 p=0.5
##
## (Note: the likelihood ratio and score tests assume independence of
## observations within a cluster, the Wald and robust score tests do not).
Step 6: Reporting the Results
We use gtsummary and sjPlot to visualise
the Hazard Ratios.
6a. The Professional Table
This table provides the exact numbers for your report.
# 6. Create a professional table
tbl_regression(
fit_multi,
exponentiate = TRUE, # Convert Coef -> Hazard Ratios
label = list(
current_age ~ "Player Age",
position ~ "Position",
squad_role ~ "Squad Role"
)
)| Characteristic | HR | 95% CI | p-value |
|---|---|---|---|
| Player Age | 0.94 | 0.82, 1.08 | 0.4 |
| Position | |||
| DF | — | — | |
| FW | 2.41 | 0.77, 7.49 | 0.13 |
| GK | 1.14 | 0.11, 12.0 | >0.9 |
| MF | 1.20 | 0.31, 4.71 | 0.8 |
| Squad Role | |||
| Fringe | — | — | |
| Key Starter | 4.30 | 1.27, 14.5 | 0.019 |
| Rotation | 0.94 | 0.21, 4.31 | >0.9 |
| Abbreviations: CI = Confidence Interval, HR = Hazard Ratio | |||
6b. The Forest Plot (Visualising the Risk)
This is probably an effective way to communicate to the non-statistical staff/coaches.
# 7. Forest Plot
plot_model(
fit_multi,
show.values = TRUE,
exponentiate = TRUE,
title = "Hazard Ratios for Time-Loss Injury",
vline.color = "grey"
)How to read the Forest Plot:
- The Vertical Line (at 1.0): This is the “Line of No Effect.” If a result touches this line, it is not statistically significant.
- The Blue Dot: This is the Hazard Ratio.
1.0: Higher Risk (Right side).
< 1.0: Lower Risk / Protective (Left side).
- The Horizontal Line (Whiskers): This is the Confidence Interval. If this line crosses the vertical grey line, we cannot be sure of the result.
- Reference Groups: “Key Starters” are being compared to the baseline group (Fringe Players). “Forwards” are compared to Defenders.
Step 7: Conclusion: Interpreting the Data for the Coaches
Unlike simple counts, Survival Analysis allows us to separate “Bad Luck” from Systemic Risk.
Based on the simulated data, here is the actionable insight can be presented to the backroom team:
1. The “Cost” of Being a Starter
The model identifies Squad Role as a highly significant
predictor (p<0.05).
Key Starters have a Hazard Ratio (HR) significantly greater than 1.0 compared to Fringe players.
Translation: A Key Starter is 4.30 times more likely to suffer a
time-loss injury on any given day than a Fringe player. This model
treats squad_role as a proxy for exposure and suggests that
players in Key Starter roles have substantially higher injury hazard
than Fringe players.
2. Age & Position
Age: The model estimates a hazard ratio of 0.94 per additional year of age (95% CI 0.82–1.08, p = 0.4), suggesting no clear evidence that age meaningfully changes injury risk in this simulated, relatively young squad.
Given the young age profile (mean ≈ 22.7 years), age effects may only become apparent once older players (30+) are included.
Position: While Forwards may show a trend toward higher risk, the confidence intervals (whiskers in the Forest Plot) likely overlap with 1.0, suggesting that position alone is not the primary driver of risk…workload is.
RECOMMENDATION?
In this simulated data set, Key Starters show an
estimated hazard ratio of 4.30 (1.27–14.5) versus
Fringe players, indicating substantially higher risk under
higher exposure.
The medical team cannot “fix” a player’s age. However, this simulation illustrates how rotation can function as a medical intervention by reducing exposure‐related risk.
Reducing a “Key Starter” to a “Rotation” role for 2-3 weeks during the season effectively lowers their hazard rate and extends their survival probability significantly.
At this point, you may start to roll your eyes and say, “YA
DON’T SAY, CAPTAIN OBVIOUS!”
While it is intuitive that high workload increases risk, this analysis
quantifies exactly when and how that
risk becomes critical.
1. The “135-Day Wall”
Our Kaplan-Meier curves for Key Starters (Figure 1) show a relatively stable survival rate for the first 4 months. However, we identify a critical inflection point around Day 135 (Mid-Season).
Insight: Up until Day 135, the risk profile of a Starter is manageable. After Day 135, the survival probability crashes by nearly 40%.
Action: Rotation strategies shouldn’t be uniform. You can “ride” your starters for the first 3 months, but mandatory rotation cycles must kick in before Day 135 to avoid the statistical crash.
2. Quantifying the Gamble (The Hazard Ratio)
We found a Hazard Ratio of 4.30 for Key Starters.
Translation for the Coach: Every time you play a Key Starter in a “Fringe” game (e.g., early Cup round) instead of resting them, you are accepting an injury hazard consistent with what we estimate for Key Starters across the season, that is, approximately 4x higher than Fringe players in this model.
Decision Support: Is the match importance high enough to justify a 400% increase in injury probability? If not, the model suggests that rotation is the safer option from an injury risk‑management perspective.
Limitations & Future Directions
While this tutorial provides a solid foundation for survival analysis in football injury risk, several limitations must be acknowledged:
Age Profile: Our current squad is relatively young (average age ~22). As we recruit older players, we would expect “Age” to become a more significant risk factor, requiring a separate analysis for the 30+ cohort.
Training Load: This model uses Match Minutes as a proxy for load. Future iterations should incorporate GPS training load data to capture the “hidden” work done on the training pitch.
Also,
“Fringe/rotation/key starter” in
this tutorial are simulated abstractions tied to expected minutes. In a
real club setting they should be operationalised from objective exposure
data. Some might also be quick to point out that DF,
MF, FW, GK are broad positional
groupings that do not fully capture modern tactical roles and
responsibilities.
Further, applying it to a live high-performance environment requires addressing several real-world complexities that this simulation simplifies.
1. The “Mid-Season” Problem (Left Truncation)
Our current model assumes a static cohort where every player is present from Day 0 with no additions/new signings to the squad.
Real-World Scenario: A new striker is signed in the Mid Season Transfer Window (Day 150).
The Issue: If we simply add them to the dataset, the model might incorrectly assume they “survived” the first 150 days injury-free. In reality, they were not “at risk” in our environment during that period.
Future Solution: Advanced survival models handle this using Left Truncation (Delayed Entry). This ensures a player only contributes to the “At Risk” calculation during the specific window they are actually present at the club.
2. Departing Players (Informative Censoring)
We currently treat all censoring (end of season vs. transfer out) as “non-informative”, meaning the reason for leaving is unrelated to injury risk.
Real-World Scenario: A player is sold because they are injury-prone, or a player is released because they cannot handle the training load.
The Issue: If high-risk players are systematically removed from the study (sold), our remaining pool looks artificially healthier than it is.
Future Solution: Conducting a Competing Risks Analysis to distinguish between “Random Censoring” (e.g., End of Loan) and “Informative Censoring” (e.g., Medical Release).
3. Recurrent Injuries
Standard Survival Analysis usually stops at the “First Failure.” Once a player gets injured (status = 1), they drop out of the study.
Real-World Scenario: A winger strains their hamstring in August, returns in October, and strains it again in December.
The Issue: By ignoring the second and third injuries, we underestimate the total burden of injury on the squad. Furthermore, the first injury often increases the hazard rate for the second.
4. Training Load vs. Match Exposure
Our model currently uses squad_role as a proxy for exposure.
actual_minutes is available but is only
used descriptively; a full time‑varying implementation is left for
future work
Real-World Scenario: A “Fringe” player might play zero match minutes but be subjected to intense “top-up” conditioning sessions to maintain fitness.
The Issue: The model views this player as “low exposure,” but their physiological load (and tissue stress) might actually be high.
Future Solution: Integrating GPS / External Load data (Total Distance, High-Speed Running, Athlete Load) as the exposure metric rather than simple Match Minutes. This would provide a complete picture of the mechanical stressors placed on the athlete.