1. Introduction: The Challenge of Data-Driven Injury Prevention in Collegiate Athletics

Sports-related injuries represent a critical and costly challenge in collegiate athletics, compromising athlete health, diminishing team performance, and impacting long-term career sustainability. Moving from reactive treatment to proactive prevention requires accurately identifying which athletes are at the highest risk before an incident occurs.

The Complexity of Injury Risk and the “Web of Determinants”

While advances in wearable sensors and athlete monitoring systems have enabled the continuous quantification of physiological and behavioral metrics, translating this data into actionable prevention strategies remains complex. Traditional monitoring often relies on isolated metrics or linear associations, such as a single workload threshold. However, contemporary research suggests these methods are insufficient for capturing the full “web of determinants” that leads to injury, where risk emerges from non-linear, multivariate interactions between training load, recovery status, and physiological capacity. This necessitates a systematic, data-driven approach.

The Variables and Their Non-Linear Correlation with Injury

This project focuses on identifying complex patterns within a collegiate athlete dataset by analyzing variables grouped into three critical categories. The primary challenge for any predictive model, including the one developed here, is to detect these subtle, compounding risk patterns:

A. Workload Metrics (The Primary Drivers of Risk)

Workload variables are central to injury prediction, but their correlation is paradoxical:

Training Load, Endurance Hours, and Step Count: The relationship is non-linear, governed by the Training-Injury Prevention Paradox. The main risk factor is not high training volume itself, but the rate of load change. High chronic workloads (consistent, long-term high training) are often protective, acting like a “vaccine” that builds physical resilience. Conversely, rapid, acute spikes in load compared to the athlete’s chronic average are the main trigger for soft-tissue injuries, as the body cannot adapt quickly enough.

Risky Moves: This variable is expected to show a positive correlation with risk. A higher count suggests repeated exposure to compromised movement quality or high biomechanical stress, leading to cumulative microtrauma and structural failure.

B. Physiological Readiness (Markers of Internal Stress)

These variables measure the body’s capacity to handle demands and its internal stress response:

Heart Rate Variability (HRV): This generally shows a negative correlation (lower HRV = higher risk). Low HRV indicates the autonomic nervous system is dominated by the sympathetic (stress) branch, signaling high accumulated physiological stress, poor recovery, or maladaptation to training.

VO2 Max: This is expected to show a negative correlation (higher VO2 Max = lower risk). Higher cardiorespiratory fitness typically implies greater physical resilience and better capacity to manage high training loads without incurring excessive strain or fatigue-related movement errors.

Age: The correlation is often non-linear in the athletic population, with risk potentially elevated at the extremes of the collegiate range due to ongoing development or reduced tissue elasticity.

C. Behavioral & Recovery Metrics (The Mediating Factors)

These variables dictate the body’s ability to repair and adapt, and are critical for mitigating risk:

Sleep Hours & Recovery Time: Both show a negative correlation (lower values = higher risk). Adequate sleep and recovery time are vital for physiological repair and neurocognitive recovery. Chronic deficits impair motor control, reduce reaction time, and compound residual fatigue, directly increasing the risk of sustaining a fatigue-related injury.

Feedback Level: The correlation is complex and indirect. As a proxy for the psychological environment, very high or very low feedback may point to an overly stressful or neglectful environment, which contributes to psychological stress—a known indirect risk factor for injury.

Project Goals and Machine Learning Methodology

This assignment is designed to systematically address the challenges presented by this complex data through a two-fold approach:

Systematic Data Exploration: The primary goal is to systematically explore the provided collegiate athlete biometric dataset to identify meaningful, albeit often subtle, relationships between the key variables and the ultimate outcome: injury history. This initial exploration is crucial for understanding the data separation and overlap—or lack thereof—between injured and uninjured groups, which informs the expectations for subsequent predictive modeling.

Predictive Modeling and Application Development: The second goal is to leverage a Naive Bayes classification model to quantify the probabilistic risk of injury based on the input features. This model serves as the computational core for a fully functional, interactive Shiny application. This application provides external users (coaches, athletes, or researchers) a platform for intuitive data exploration and the practical application of the machine learning results, allowing them to input new data and receive a probabilistic injury risk score.

The Role and Mixed Success of Machine Learning

By employing the Naive Bayes model, this project attempts to detect the required nuanced multivariate patterns (e.g., flagging an athlete who exhibits low HRV AND low sleep hours AND a high acute training load) that traditional univariate analysis often misses. However, the successful application of machine learning in sports injury prediction is often constrained by real-world data limitations: datasets are typically small, noisy, and contain significant overlap between outcome classes. Therefore, while ML provides a powerful framework, the derived probabilities must be approached with caution. The model and its predictions are intended as a proof-of-concept and a platform for further analysis, not a definitive diagnostic tool. The predictive strength is limited by the current dataset, underscoring that the app’s greatest utility lies in its framework and ability to facilitate intuitive data exploration.

2. Load Libraries and Data

To establish the analytical framework, several R packages were loaded to support data manipulation, visualization, modeling, and evaluation. The tidyverse suite was used for data import, cleaning, and manipulation, as well as for foundational visualization. Exploratory data analysis was supported by ggplot2, GGally, pheatmap, and corrplot, enabling visualization of distributions, pairwise relationships, heatmaps, and correlation structures.

Supervised machine learning was implemented using caret for model training and evaluation workflows, randomForest for ensemble-based classification, and naivebayes for probabilistic modeling. Model performance was assessed using receiver operating characteristic (ROC) analysis via the pROC package.

Unsupervised learning was supported by the cluster, mclust, and factoextra packages, which enabled hierarchical, distance-based, and model-based clustering as well as visualization of principal component and clustering results. The rsample package was used for data splitting and resampling procedures, and broom was used to tidy and organize model outputs for downstream analysis and reporting.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(pheatmap)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(GGally)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rsample)
## 
## Attaching package: 'rsample'
## 
## The following object is masked from 'package:caret':
## 
##     calibration
library(cluster)
library(mclust)
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## 
## The following object is masked from 'package:dplyr':
## 
##     count
## 
## The following object is masked from 'package:purrr':
## 
##     map
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(naivebayes)
## naivebayes 1.0.0 loaded
## For more information please visit: 
## https://majkamichal.github.io/naivebayes/
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(broom)

athletes <- read_csv("College_Sports_Dataset.csv")
## Rows: 1050 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Athlete_ID, Gender, Sport
## dbl (12): Age, HR_Variability, VO2_Max, Speed_Index, Endurance_Hours, Risky_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Required R packages were loaded to support data manipulation, visualization, and modeling. The read_csv() function imported the dataset containing 1,050 observations and 15 variables. The initial column specification confirmed that the data includes a mix of character types (e.g., Athlete_ID, Gender) and double-precision numeric types (e.g., HR_Variability, Training_Load), indicating that the raw data structure was successfully preserved during import.

3. Data Wrangling and Preprocessing

Prior to exploratory analysis and predictive modeling, the dataset must be structured to reflect the measurement scale and analytical role of each variable. Inconsistent encoding or inappropriate data types can violate model assumptions, bias results, or produce silent computational errors. This section outlines the preprocessing steps used to prepare the athlete biometric dataset for downstream statistical analysis and machine learning by enforcing correct categorical encoding, numeric representation, and structural integrity.

3.1 Factor Conversion

Categorical variables were explicitly converted to factors to ensure appropriate handling by classification algorithms. Using mutate() from the dplyr package, Gender, Sport, and Feedback_Level were recoded as factors. The outcome variable, Injury_History, was converted from a binary numeric indicator (0/1) to a labeled factor with levels “Uninjured” and “Injured”, ensuring that modeling functions treat the outcome as a categorical class rather than a numeric quantity.

athletes <- athletes %>%
mutate(
Gender = factor(Gender),
Sport = factor(Sport),
Feedback_Level = factor(Feedback_Level),
Injury_History = factor(
Injury_History,
levels = c(0, 1),
labels = c("Uninjured", "Injured")
)
)

This encoding standardizes the treatment of categorical predictors and establishes Injury_History as a discrete classification outcome across all supervised learning models used in this study.

3.2 Numeric Variable Definition

Continuous predictors were explicitly identified and coerced to numeric format to ensure compatibility with analyses that assume numeric input, including correlation analysis, principal component analysis (PCA), clustering, and distance-based methods. A centralized vector containing the names of biometric, behavioral, and workload-related variables was defined to control which predictors were included in these analyses.

numeric_vars <- c(
"Age", "HR_Variability", "VO2_Max", "Speed_Index",
"Endurance_Hours", "Risky_Moves", "Sleep_Hours",
"Step_Count", "Training_Load", "Recovery_Time"
)

athletes[numeric_vars] <- lapply(athletes[numeric_vars], as.numeric)

Explicitly enforcing numeric data types prevents unintended coercion during matrix operations and model fitting, while the use of a centralized variable list improves reproducibility and consistency across subsequent analytical workflows.

3.3 Structural Checks

Following preprocessing, structural validation was performed to confirm that the dataset conforms to analytical expectations and that no unintended transformations occurred during recoding. Dataset structure and variable types were inspected using glimpse(), and the distribution of the injury outcome was summarized using frequency and proportional counts.

glimpse(athletes)
## Rows: 1,050
## Columns: 15
## $ Athlete_ID      <chr> "A0001", "A0002", "A0003", "A0004", "A0005", "A0006", …
## $ Age             <dbl> 24, 21, 22, 24, 20, 22, 22, 24, 19, 20, 24, 20, 20, 22…
## $ Gender          <fct> M, F, M, M, M, F, F, F, F, M, F, F, M, M, M, F, M, M, …
## $ Sport           <fct> Tennis, Volleyball, Swimming, Basketball, Basketball, …
## $ HR_Variability  <dbl> 51.39, 51.20, 69.62, 59.47, 52.19, 56.29, 49.89, 59.33…
## $ VO2_Max         <dbl> 38.07, 52.25, 40.50, 41.16, 48.19, 43.57, 46.54, 34.22…
## $ Speed_Index     <dbl> 6.71, 5.47, 5.61, 7.88, 6.14, 5.95, 5.16, 7.35, 5.44, …
## $ Endurance_Hours <dbl> 4.18, 3.19, 4.77, 4.29, 3.09, 2.72, 3.22, 4.67, 3.76, …
## $ Risky_Moves     <dbl> 1, 3, 1, 4, 2, 3, 2, 2, 1, 1, 1, 4, 1, 2, 1, 2, 3, 2, …
## $ Sleep_Hours     <dbl> 7.8, 6.1, 6.4, 7.9, 6.3, 6.8, 6.0, 8.8, 7.5, 6.7, 6.6,…
## $ Step_Count      <dbl> 14229, 9239, 14566, 11391, 12213, 11997, 11006, 11039,…
## $ Injury_History  <fct> Uninjured, Injured, Uninjured, Uninjured, Injured, Inj…
## $ Training_Load   <dbl> 392, 361, 250, 202, 327, 318, 257, 324, 399, 312, 254,…
## $ Recovery_Time   <dbl> 30.1, 21.0, 24.3, 25.9, 16.6, 22.6, 30.6, 29.4, 22.5, …
## $ Feedback_Level  <fct> 5, 4, 1, 3, 2, 4, 4, 5, 4, 5, 5, 3, 1, 4, 1, 1, 1, 5, …
injury_tab <- table(athletes$Injury_History)
injury_tab
## 
## Uninjured   Injured 
##       743       307
prop.table(injury_tab)
## 
## Uninjured   Injured 
##  0.707619  0.292381

The inspection confirms that the dataset contains 1,050 collegiate athletes across 15 variables, with categorical fields correctly encoded as factors and biometric and workload-related variables stored as numeric values. The injury outcome distribution includes 743 uninjured athletes (70.8%) and 307 injured athletes (29.2%), indicating a moderate class imbalance.

This imbalance has direct implications for model evaluation, as accuracy-based metrics may be biased toward the majority class. Consequently, subsequent analyses emphasize performance measures that better capture discrimination and minority-class performance, including sensitivity, specificity, balanced accuracy, and ROC–AUC.

Overall, these preprocessing and validation steps establish a consistent and reliable analytical foundation for the exploratory and predictive analyses that follow.

4. Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) was conducted to examine the structure, distributions, and relationships within the collegiate athlete biometric dataset prior to formal modeling. The primary goals were to verify data quality and physiological plausibility, and to assess whether injury status exhibits discernible univariate or low-dimensional patterns that could support straightforward classification. Given the multifactorial nature of sports-related injuries, EDA also serves to determine whether more complex multivariate or nonlinear approaches are warranted.

4.1 Summary Tables

Summary statistics were generated to obtain a high-level overview of variable distributions and to identify potential data quality issues, including implausible values or preprocessing errors. Numeric variables were summarized using standard descriptive measures, while categorical variables were summarized by frequency.

summary(athletes[numeric_vars])
##       Age        HR_Variability     VO2_Max       Speed_Index    
##  Min.   :18.00   Min.   :30.98   Min.   :28.80   Min.   : 3.560  
##  1st Qu.:19.00   1st Qu.:53.88   1st Qu.:41.65   1st Qu.: 5.870  
##  Median :21.00   Median :60.84   Median :45.02   Median : 6.500  
##  Mean   :20.95   Mean   :60.61   Mean   :44.97   Mean   : 6.519  
##  3rd Qu.:23.00   3rd Qu.:67.46   3rd Qu.:48.36   3rd Qu.: 7.168  
##  Max.   :24.00   Max.   :92.60   Max.   :60.61   Max.   :10.160  
##  Endurance_Hours  Risky_Moves    Sleep_Hours       Step_Count   
##  Min.   :1.030   Min.   :0.00   Min.   : 3.800   Min.   : 8009  
##  1st Qu.:3.430   1st Qu.:1.00   1st Qu.: 6.200   1st Qu.: 9844  
##  Median :4.020   Median :2.00   Median : 7.000   Median :11587  
##  Mean   :3.992   Mean   :1.99   Mean   : 6.952   Mean   :11585  
##  3rd Qu.:4.550   3rd Qu.:3.00   3rd Qu.: 7.600   3rd Qu.:13446  
##  Max.   :6.270   Max.   :7.00   Max.   :10.200   Max.   :14989  
##  Training_Load   Recovery_Time  
##  Min.   :200.0   Min.   : 9.90  
##  1st Qu.:249.0   1st Qu.:20.70  
##  Median :300.5   Median :24.00  
##  Mean   :300.2   Mean   :24.09  
##  3rd Qu.:351.8   3rd Qu.:27.50  
##  Max.   :399.0   Max.   :41.10
summary(dplyr::select(athletes, Gender, Sport, Injury_History, Feedback_Level))
##  Gender         Sport       Injury_History Feedback_Level
##  F:499   Athletics :191   Uninjured:743    1:237         
##  M:551   Basketball:182   Injured  :307    2:217         
##          Soccer    :166                    3:197         
##          Swimming  :188                    4:209         
##          Tennis    :172                    5:190         
##          Volleyball:151

The numeric summaries indicate physiologically realistic and internally consistent ranges across biometric and workload-related variables. Measures such as VO₂ max, sleep duration, training load, and step count fall within expected ranges for collegiate athletes, and no extreme anomalies are observed. Categorical summaries confirm representation across genders and sports, with injury history reflecting a higher proportion of uninjured athletes, consistent with real-world athletic populations.

4.2 Correlation Matrix

Linear associations among numeric predictors were examined to assess shared structure and potential multicollinearity. A Pearson correlation matrix was computed and visualized to facilitate identification of correlated variable groups.

corr_mat <- cor(athletes[numeric_vars], use = "pairwise.complete.obs")

corrplot(
corr_mat,
method = "color",
type = "upper",
tl.cex = 0.8,
addCoef.col = NA
)

The correlation matrix reveals moderate positive associations among workload-related variables such as training load, step count, and endurance hours, as well as among aerobic fitness indicators including VO₂ max and speed index. No correlations approach thresholds typically associated with problematic multicollinearity, indicating that predictors are related but not redundant and can be jointly considered in multivariable analyses.

4.3 Pairwise Relationships

Pairwise relationships among key predictors were examined using matrix-style visualization to assess whether injury status is associated with separable patterns across variable combinations.

ggally_vars <- athletes %>%
dplyr::select(
HR_Variability, VO2_Max, Speed_Index,
Training_Load, Recovery_Time, Sleep_Hours,
Risky_Moves, Injury_History
)

ggpairs(
ggally_vars,
aes(color = Injury_History, alpha = 0.5)
) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

Across most variable pairs, injured and uninjured athletes exhibit substantial overlap, indicating limited separability based on pairwise relationships alone. Correlation patterns observed here are consistent with those identified in the correlation matrix. While injured athletes appear slightly more concentrated at higher training loads and risky movement counts, particularly in the context of shorter recovery times, these trends are subtle and do not define clear boundaries between groups.

4.4 Density Plots by Injury Status

To examine group-level distributional differences, density plots were generated for selected workload, recovery, and physiological variables.

density_vars <- c(
"Training_Load", "Sleep_Hours",
"HR_Variability", "Recovery_Time",
"VO2_Max", "Risky_Moves"
)

for (v in density_vars) {
print(
ggplot(athletes, aes(x = .data[[v]], fill = Injury_History)) +
geom_density(alpha = 0.4) +
theme_minimal() +
labs(title = paste("Density of", v, "by Injury Status"))
)
}

Across all variables, density curves for injured and uninjured athletes overlap substantially, indicating weak univariate discrimination. Modest shifts are observed for training load, risky movements, and recovery time, suggesting these variables may contribute to injury risk in combination rather than independently.

4.5 Boxplots by Injury Status

Boxplots were used to complement density plots by highlighting group medians, variability, and potential outliers.

for (v in density_vars) {
print(
ggplot(athletes, aes(x = Injury_History, y = .data[[v]], fill = Injury_History)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = paste("Boxplot of", v, "by Injury Status"), x = "")
)
}

These visualizations confirm that injured athletes tend to exhibit slightly higher central values for workload and risky movement metrics, though substantial overlap persists across groups. No evidence of strong threshold effects or extreme outliers is observed.

4.6 Injury Distribution

Class balance was examined to inform subsequent model evaluation strategies.

ggplot(athletes, aes(x = Injury_History, fill = Injury_History)) +
geom_bar() +
theme_minimal() +
labs(title = "Class Distribution: Injury vs No Injury")

The distribution indicates a larger uninjured class relative to injured athletes, reflecting moderate class imbalance. This motivates the use of evaluation metrics that emphasize minority-class performance rather than overall accuracy alone.

4.7 Injury by Sport

Injury prevalence was examined across sports to assess whether sport-specific differences are evident.

injury_by_sport <- athletes %>%
dplyr::count(Sport, Injury_History) %>%
group_by(Sport) %>%
mutate(prop = n / sum(n))

ggplot(injury_by_sport, aes(x = Sport, y = prop, fill = Injury_History)) +
geom_col(position = "dodge") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

While injury proportions vary across sports, no sport exhibits extreme divergence. This suggests that sport alone is insufficient as a predictor and should be interpreted in conjunction with workload, recovery, and biometric variables.

4.8 EDA Synthesis

Collectively, the exploratory analyses demonstrate that injury risk in collegiate athletes is not driven by any single variable or simple low-dimensional pattern. Instead, risk appears to arise from gradual shifts across multiple domains, with substantial overlap between injured and uninjured groups. These findings strongly motivate the use of multivariate, probabilistic, and potentially nonlinear modeling approaches in subsequent analyses.

5. Dimensionality Reduction

Dimensionality reduction techniques were applied to assess whether the high-dimensional biometric dataset can be meaningfully summarized by a smaller number of latent structures. Many biometric and training variables reflect overlapping physiological processes, including aerobic capacity, workload accumulation, and recovery. These methods are used here strictly for exploratory diagnosis rather than prediction, with the goal of determining whether injury status aligns with dominant low-dimensional structure or remains distributed across multiple interacting dimensions.

5.1 Principal Component Analysis

Principal Component Analysis (PCA) was used to examine whether variance in the numeric biometric dataset concentrates along a small number of orthogonal axes. By transforming correlated predictors into uncorrelated components, PCA allows evaluation of whether injury risk aligns with dominant physiological dimensions rather than with individual variables.

PCA was performed using the prcomp() function with centering and scaling to account for heterogeneous measurement units. Variance explained by each component was examined using the summary() output.

pca_vars <- athletes %>% dplyr::select(all_of(numeric_vars))

# Preserve athlete identity for PCA and clustering
pca_vars <- athletes %>%
  dplyr::select(all_of(numeric_vars))

rownames(pca_vars) <- athletes$Athlete_ID
## Warning: Setting row names on a tibble is deprecated.
ath_pca <- prcomp(pca_vars, center = TRUE, scale. = TRUE)
summary(ath_pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4    PC5     PC6     PC7
## Standard deviation     1.0736 1.0590 1.0331 1.0203 1.0145 0.99321 0.98433
## Proportion of Variance 0.1153 0.1121 0.1067 0.1041 0.1029 0.09865 0.09689
## Cumulative Proportion  0.1153 0.2274 0.3341 0.4382 0.5412 0.63980 0.73669
##                            PC8     PC9    PC10
## Standard deviation     0.95790 0.93557 0.91664
## Proportion of Variance 0.09176 0.08753 0.08402
## Cumulative Proportion  0.82845 0.91598 1.00000

Variance is distributed relatively evenly across components. The first principal component explains approximately 11.5% of total variance, with subsequent components each contributing comparable proportions. Approximately 54% of variance is captured by the first five components, and over 90% only after nine components. This pattern indicates that no single latent factor dominates variability and that the dataset retains substantial effective dimensionality.

5.1.1 Scree Plot

The scree plot was used to visually assess whether a small subset of components captures a disproportionate share of variance.

fviz_eig(ath_pca, addlabels = TRUE)
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.

The plot shows a smooth decline in explained variance without a pronounced elbow, indicating that aggressive dimensionality reduction would result in substantial information loss. PCA therefore functions here as a diagnostic tool rather than a compression strategy for modeling.

5.1.2 PCA Biplot

The PCA biplot was used to visualize both variable loadings and athlete projections in reduced-dimensional space, allowing assessment of whether injured and uninjured athletes occupy distinct regions along dominant axes.

fviz_pca_biplot(
ath_pca,
geom.ind = "point",
habillage = athletes$Injury_History,
addEllipses = TRUE,
label = "var",
repel = TRUE
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Fitness-related variables (e.g., VO₂ max, endurance hours) load strongly along one principal axis, while workload and behavioral variables (e.g., training load, risky movements) load along another, with recovery measures projecting in opposing directions. Injured and uninjured athletes exhibit substantial overlap, although injured athletes tend to project modestly toward higher training stress and risk-associated directions. This pattern suggests gradual risk gradients rather than discrete separation in linear component space.

The biplot shows that aerobic fitness variables load strongly along one principal axis, whereas training stress and behavioral variables load along another. Recovery-related measures project in opposing directions, reflecting physiological trade-offs between workload accumulation and recovery capacity.

Injured and uninjured athletes exhibit substantial overlap in PCA space, and confidence ellipses largely intersect. However, injured athletes tend to project modestly toward directions associated with higher training stress and riskier movement profiles.

This visualization reinforces earlier EDA findings: injury risk does not correspond to a distinct linear combination of biometric variables but appears to increase gradually along axes associated with cumulative stress and insufficient recovery. These patterns support modeling injury risk as a probabilistic outcome rather than a discrete category.

5.2 Hierarchical Clustering

Hierarchical clustering is used to explore whether athletes naturally group into biometric subprofiles based solely on physiological, behavioral, and workload-related variables, without incorporating injury status. Unlike supervised models, clustering can reveal latent structure or athlete “profiles” that may correspond to different risk states. Hierarchical clustering was used to explore whether athletes naturally group into biometric profiles based on physiological, behavioral, and workload variables alone, without reference to injury status. A three-cluster solution was examined to reflect biologically plausible risk strata rather than binary injury classification.

All numeric variables were scaled prior to distance computation. Euclidean distance and Ward’s linkage were used to favor compact clusters.

pca_scaled <- scale(pca_vars)
dist_mat <- dist(pca_scaled)
hc <- hclust(dist_mat, method = "ward.D2")

plot(hc, labels = FALSE)
rect.hclust(hc, k = 3, border = "red")

The dendrogram reveals gradual branching without sharply separated clusters. Cutting the tree at k = 3 yields 1 very small cluster, one intermediate one, and one very large cluster, indicating that a group dominates the data (uninjured as we have shown before) but that there might be an intermediate group of uninjured but with injured characteristics that may be at risk of injury in the future.

5.2.1 Heatmap-Based Clustering Visualization

A heatmap was used to link clustering structure to underlying biometric patterns and to examine alignment with injury status.

# Build annotation using PCA row order
anno_athletes <- athletes %>%
  dplyr::select(Athlete_ID, Injury_History) %>%
  tibble::column_to_rownames("Athlete_ID") %>%
  .[rownames(pca_scaled), , drop = FALSE]

# Sanity check (should be TRUE)
identical(rownames(anno_athletes), rownames(pca_scaled))
## [1] TRUE
# Heatmap
pheatmap(
  mat = t(pca_scaled),
  annotation_col = anno_athletes,
  show_colnames = FALSE,
  clustering_method = "ward.D2",
  clustering_distance_cols = "euclidean",
  fontsize = 6,
  main = "Hierarchical Clustering Heatmap of Athlete Biometrics"
)

The heatmap displays smooth gradients across variables rather than abrupt transitions between clusters. Training load, risky movements, and recovery time show greater variability than fitness measures (e.g., VO2_Max and HR_Variability), while injury status does not align cleanly with any cluster.

5.2.2 Formal Cluster–Injury Association Test

To formally assess whether cluster membership is associated with injury history, a chi-square test was performed.

clusters <- cutree(hc, k = 3)

cluster_table <- table(clusters, athletes$Injury_History)
cluster_table
##         
## clusters Uninjured Injured
##        1       575     224
##        2       120      50
##        3        48      33
chisq.test(cluster_table)
## 
##  Pearson's Chi-squared test
## 
## data:  cluster_table
## X-squared = 5.7414, df = 2, p-value = 0.05666

The test indicates a non-significant association, confirming that hierarchical clusters do not correspond to injury status.

5.3 Model-Based Clustering

Model-based clustering was applied to determine whether probabilistic Gaussian mixture models identify latent structure more effectively and to objectively select cluster number using the Bayesian Information Criterion (BIC).

mc <- Mclust(pca_scaled)
summary(mc)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEI (diagonal, equal volume and shape) model with 4 components: 
## 
##  log-likelihood    n df       BIC       ICL
##       -14692.89 1050 53 -29754.48 -30062.31
## 
## Clustering table:
##   1   2   3   4 
## 254 254 272 270
plot(mc, what = "BIC")

plot(mc, what = "classification")

The BIC curve shows limited improvement with increasing model complexity. The selected solution identifies multiple components of similar size with substantial overlap, and cluster assignments do not align with injury status.

5.4 Dimensionality Reduction Synthesis

Across PCA, hierarchical clustering, and model-based clustering, biometric variability does not organize into discrete injury-defined groups or low-dimensional structures. Instead, variability is distributed across multiple partially independent axes, with injury risk increasing gradually along dimensions associated with cumulative workload, insufficient recovery, and risky movement behavior. These findings justify the use of supervised, probabilistic modeling approaches in subsequent sections and underscore the inherent complexity of injury prediction in athletic populations.

6. Predictive Modeling

This section evaluates whether biometric, training, and demographic variables can predict injury history at the individual athlete level using supervised learning. Three complementary modeling paradigms were selected to reflect different assumptions about the injury-generation process: logistic regression (linear effects), Random Forest (nonlinear interactions), and Naive Bayes (probabilistic class-conditional structure).

All models were trained on a designated training set and evaluated on a held-out test set to assess generalization performance. Given the moderate class imbalance and the clinical relevance of identifying injured athletes, evaluation emphasizes discrimination and class-specific metrics rather than overall accuracy alone.

6.1 Train/Test Split and Balance Diagnostics

To enable unbiased model evaluation, the dataset was randomly partitioned into training (two-thirds) and test (one-third) subsets. Balance diagnostics were performed to confirm that the split preserved outcome prevalence and predictor distributions, thereby minimizing sampling artifacts that could confound model comparisons.

set.seed(123)

model_data <- athletes %>%
dplyr::select(Injury_History, all_of(numeric_vars), Gender, Sport, Feedback_Level)

# Create random 2/3–1/3 split


test_index <- sample(seq_len(nrow(model_data)), size = round(nrow(model_data) / 3))

train_data <- model_data[-test_index, ]
test_data  <- model_data[test_index, ]

# Add indicator for balance checks

model_data$subset <- "train"
model_data$subset[test_index] <- "test"
model_data$subset <- factor(model_data$subset)

# Dimensions

dim(train_data)
## [1] 700  14
dim(test_data)
## [1] 350  14
# Outcome balance

prop.table(table(train_data$Injury_History))
## 
## Uninjured   Injured 
## 0.6885714 0.3114286
prop.table(table(test_data$Injury_History))
## 
## Uninjured   Injured 
## 0.7457143 0.2542857
# Numeric balance checks

numeric_balance <- model_data %>% dplyr::select(where(is.numeric), subset)

numeric_balance %>%
pivot_longer(cols = -subset, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = subset, y = value)) +
facet_wrap(~variable, scales = "free") +
geom_boxplot() +
theme_minimal()

numeric_pvals <- numeric_balance %>%
dplyr::select(-subset) %>%
summarise(across(everything(), ~ t.test(.x ~ model_data$subset)$p.value))

numeric_pvals
## # A tibble: 1 × 10
##     Age HR_Variability VO2_Max Speed_Index Endurance_Hours Risky_Moves
##   <dbl>          <dbl>   <dbl>       <dbl>           <dbl>       <dbl>
## 1 0.337          0.699   0.771      0.0363           0.976      0.0853
## # ℹ 4 more variables: Sleep_Hours <dbl>, Step_Count <dbl>, Training_Load <dbl>,
## #   Recovery_Time <dbl>
# Categorical balance checks

categorical_balance <- model_data %>% dplyr::select(where(is.factor), subset)

categorical_balance %>%
pivot_longer(cols = -subset, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = subset, fill = value)) +
facet_wrap(~variable) +
geom_bar(position = "fill") +
theme_minimal()

chisq_outcome <- model_data %>%
dplyr::count(Injury_History, subset) %>%
pivot_wider(names_from = subset, values_from = n, values_fill = 0) %>%
column_to_rownames("Injury_History") %>%
chisq.test()

chisq_outcome$p.value
## [1] 0.06474104

Visual inspection and formal tests indicate that injury prevalence, numeric predictor distributions, and categorical variable proportions are comparable between subsets. These checks confirm that downstream performance differences can be attributed to model behavior rather than structural differences introduced during data partitioning.

6.2 Logistic Regression

6.2.1 Baseline GLM

Logistic regression was used as an interpretable baseline model, estimating injury probability as a linear combination of biometric, workload, and contextual predictors. This specification reflects a standard multivariable approach in sports injury research and provides a benchmark for evaluating more flexible models.

log_model <- glm(
Injury_History ~ Age + HR_Variability + VO2_Max + Speed_Index +
Endurance_Hours + Risky_Moves + Sleep_Hours + Step_Count +
Training_Load + Recovery_Time + Gender + Sport + Feedback_Level,
data = train_data,
family = binomial
)

summary(log_model)
## 
## Call:
## glm(formula = Injury_History ~ Age + HR_Variability + VO2_Max + 
##     Speed_Index + Endurance_Hours + Risky_Moves + Sleep_Hours + 
##     Step_Count + Training_Load + Recovery_Time + Gender + Sport + 
##     Feedback_Level, family = binomial, data = train_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -3.485e+00  1.728e+00  -2.017  0.04372 * 
## Age              8.560e-02  4.228e-02   2.024  0.04293 * 
## HR_Variability   7.946e-03  8.173e-03   0.972  0.33097   
## VO2_Max         -1.091e-02  1.718e-02  -0.635  0.52531   
## Speed_Index     -2.918e-02  8.882e-02  -0.328  0.74254   
## Endurance_Hours  1.084e-01  1.036e-01   1.047  0.29530   
## Risky_Moves      6.906e-02  5.535e-02   1.248  0.21215   
## Sleep_Hours     -3.967e-02  8.429e-02  -0.471  0.63789   
## Step_Count       7.549e-05  4.070e-05   1.855  0.06363 . 
## Training_Load   -2.314e-03  1.464e-03  -1.580  0.11401   
## Recovery_Time    6.557e-03  1.632e-02   0.402  0.68778   
## GenderM          2.328e-01  1.683e-01   1.383  0.16665   
## SportBasketball  4.058e-01  2.926e-01   1.387  0.16549   
## SportSoccer      4.972e-01  3.010e-01   1.652  0.09859 . 
## SportSwimming    2.505e-01  2.933e-01   0.854  0.39303   
## SportTennis      7.987e-01  2.864e-01   2.789  0.00529 **
## SportVolleyball  5.056e-01  3.086e-01   1.638  0.10134   
## Feedback_Level2 -2.803e-01  2.623e-01  -1.069  0.28526   
## Feedback_Level3 -1.068e-01  2.638e-01  -0.405  0.68549   
## Feedback_Level4 -6.421e-02  2.578e-01  -0.249  0.80332   
## Feedback_Level5 -6.699e-02  2.600e-01  -0.258  0.79666   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 868.33  on 699  degrees of freedom
## Residual deviance: 842.97  on 679  degrees of freedom
## AIC: 884.97
## 
## Number of Fisher Scoring iterations: 4
log_prob <- predict(log_model, test_data, type = "response")
log_pred <- factor(ifelse(log_prob > 0.5, "Injured", "Uninjured"),
levels = c("Uninjured", "Injured"))

log_roc <- roc(test_data$Injury_History, log_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(log_roc)
## Area under the curve: 0.5354
cm_log <- confusionMatrix(log_pred, test_data$Injury_History)
cm_log
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Uninjured Injured
##   Uninjured       254      89
##   Injured           7       0
##                                           
##                Accuracy : 0.7257          
##                  95% CI : (0.6758, 0.7718)
##     No Information Rate : 0.7457          
##     P-Value [Acc > NIR] : 0.8218          
##                                           
##                   Kappa : -0.0385         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9732          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7405          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.7457          
##          Detection Rate : 0.7257          
##    Detection Prevalence : 0.9800          
##       Balanced Accuracy : 0.4866          
##                                           
##        'Positive' Class : Uninjured       
## 

Test-set evaluation reveals limited discrimination. Although overall accuracy is high, this reflects correct classification of the uninjured majority. ROC–AUC values are low, and sensitivity for injured athletes is poor, indicating that linear decision boundaries do not adequately capture injury risk in this dataset.

6.2.2 Stepwise AIC GLM

### Stepwise AIC Logistic Regression

log_model_aic <- stats::step(log_model, direction = "both", trace = FALSE)
summary(log_model_aic)
## 
## Call:
## glm(formula = Injury_History ~ Age + Step_Count + Training_Load, 
##     family = binomial, data = train_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   -2.569e+00  1.052e+00  -2.443   0.0146 *
## Age            8.088e-02  4.095e-02   1.975   0.0483 *
## Step_Count     6.957e-05  3.974e-05   1.750   0.0800 .
## Training_Load -2.497e-03  1.434e-03  -1.741   0.0817 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 868.33  on 699  degrees of freedom
## Residual deviance: 858.83  on 696  degrees of freedom
## AIC: 866.83
## 
## Number of Fisher Scoring iterations: 4
log_aic_prob <- predict(log_model_aic, test_data, type = "response")
log_aic_pred <- factor(ifelse(log_aic_prob > 0.5, "Injured", "Uninjured"),
levels = c("Uninjured", "Injured"))

log_aic_roc <- roc(test_data$Injury_History, log_aic_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(log_aic_roc)
## Area under the curve: 0.5219
cm_log_aic <- confusionMatrix(log_aic_pred, test_data$Injury_History)
cm_log_aic
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Uninjured Injured
##   Uninjured       261      89
##   Injured           0       0
##                                           
##                Accuracy : 0.7457          
##                  95% CI : (0.6967, 0.7905)
##     No Information Rate : 0.7457          
##     P-Value [Acc > NIR] : 0.5285          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.7457          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.7457          
##          Detection Rate : 0.7457          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : Uninjured       
## 

Stepwise variable selection using AIC was applied to assess whether a more parsimonious linear model improves generalization. While this procedure reduces model complexity, test-set performance remains largely unchanged. Discrimination and sensitivity do not improve meaningfully, suggesting that limited predictive performance is driven by weak linear signal rather than overfitting.

Together, the logistic regression results highlight the limitations of purely linear models for injury prediction and motivate exploration of nonlinear approaches.

6.3 Random Forest

Random Forest models were applied to capture nonlinear relationships and higher-order interactions among predictors. Baseline, tuned, and reduced-feature variants were evaluated to assess robustness and sensitivity to model specification.

## 6.3.1 Baseline Random Forest Model


set.seed(123)

rf_base <- randomForest(
Injury_History ~ .,
data = train_data,
ntree = 500,
importance = TRUE
)

# Variable importance

varImpPlot(
rf_base,
main = "Variable Importance – Baseline Random Forest"
)

# Predictions

rf_base_prob <- predict(rf_base, test_data, type = "prob")[, "Injured"]
rf_base_pred <- predict(rf_base, test_data)

# ROC and AUC

rf_base_roc <- roc(test_data$Injury_History, rf_base_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls < cases
auc(rf_base_roc)
## Area under the curve: 0.5031
# Confusion matrix

cm_rf_base <- confusionMatrix(rf_base_pred, test_data$Injury_History)
cm_rf_base
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Uninjured Injured
##   Uninjured       255      84
##   Injured           6       5
##                                           
##                Accuracy : 0.7429          
##                  95% CI : (0.6937, 0.7878)
##     No Information Rate : 0.7457          
##     P-Value [Acc > NIR] : 0.5768          
##                                           
##                   Kappa : 0.0467          
##                                           
##  Mcnemar's Test P-Value : 4.798e-16       
##                                           
##             Sensitivity : 0.97701         
##             Specificity : 0.05618         
##          Pos Pred Value : 0.75221         
##          Neg Pred Value : 0.45455         
##              Prevalence : 0.74571         
##          Detection Rate : 0.72857         
##    Detection Prevalence : 0.96857         
##       Balanced Accuracy : 0.51660         
##                                           
##        'Positive' Class : Uninjured       
## 

Hyperparameter tuning explored different values of mtry and node size.

## 6.3.2 Random Forest Hyperparameter Optimization


set.seed(682)

rf_1 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 2, nodesize = 5, importance = TRUE)

rf_2 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 2, nodesize = 10, importance = TRUE)

rf_3 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 4, nodesize = 5, importance = TRUE)

rf_4 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 4, nodesize = 10, importance = TRUE)

# Compare training confusion matrices

rf_1$confusion
##           Uninjured Injured class.error
## Uninjured       473       9   0.0186722
## Injured         216       2   0.9908257
rf_2$confusion
##           Uninjured Injured class.error
## Uninjured       476       6  0.01244813
## Injured         216       2  0.99082569
rf_3$confusion
##           Uninjured Injured class.error
## Uninjured       462      20  0.04149378
## Injured         213       5  0.97706422
rf_4$confusion
##           Uninjured Injured class.error
## Uninjured       469      13  0.02697095
## Injured         212       6  0.97247706
# Select best-performing model (example: rf_4)

rf_best <- rf_4

# Variable importance

varImpPlot(
rf_best,
main = "Variable Importance – Optimized Random Forest"
)

# Predictions

rf_best_prob <- predict(rf_best, test_data, type = "prob")[, "Injured"]
rf_best_pred <- predict(rf_best, test_data)

# ROC and AUC

rf_best_roc <- roc(test_data$Injury_History, rf_best_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(rf_best_roc)
## Area under the curve: 0.4925
# Confusion matrix

cm_rf_best <- confusionMatrix(rf_best_pred, test_data$Injury_History)
cm_rf_best
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Uninjured Injured
##   Uninjured       253      85
##   Injured           8       4
##                                           
##                Accuracy : 0.7343          
##                  95% CI : (0.6847, 0.7798)
##     No Information Rate : 0.7457          
##     P-Value [Acc > NIR] : 0.712           
##                                           
##                   Kappa : 0.02            
##                                           
##  Mcnemar's Test P-Value : 3.252e-15       
##                                           
##             Sensitivity : 0.96935         
##             Specificity : 0.04494         
##          Pos Pred Value : 0.74852         
##          Neg Pred Value : 0.33333         
##              Prevalence : 0.74571         
##          Detection Rate : 0.72286         
##    Detection Prevalence : 0.96571         
##       Balanced Accuracy : 0.50715         
##                                           
##        'Positive' Class : Uninjured       
## 

A reduced-feature model excluding VO₂ max was also evaluated.

## 6.3.3 Random Forest Variable Reduction (Remove VO2_Max)


set.seed(123)

rf_no_vo2 <- randomForest(
Injury_History ~ . - VO2_Max,
data = train_data,
ntree = 500,
importance = TRUE
)

# Variable importance

varImpPlot(
rf_no_vo2,
main = "Variable Importance – Random Forest Without VO2_Max"
)

# Predictions

rf_no_vo2_prob <- predict(rf_no_vo2, test_data, type = "prob")[, "Injured"]
rf_no_vo2_pred <- predict(rf_no_vo2, test_data)

# ROC and AUC

rf_no_vo2_roc <- roc(test_data$Injury_History, rf_no_vo2_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(rf_no_vo2_roc)
## Area under the curve: 0.5024
# Confusion matrix

cm_rf_no_vo2 <- confusionMatrix(rf_no_vo2_pred, test_data$Injury_History)
cm_rf_no_vo2
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Uninjured Injured
##   Uninjured       253      84
##   Injured           8       5
##                                           
##                Accuracy : 0.7371          
##                  95% CI : (0.6877, 0.7825)
##     No Information Rate : 0.7457          
##     P-Value [Acc > NIR] : 0.6692          
##                                           
##                   Kappa : 0.0355          
##                                           
##  Mcnemar's Test P-Value : 5.312e-15       
##                                           
##             Sensitivity : 0.96935         
##             Specificity : 0.05618         
##          Pos Pred Value : 0.75074         
##          Neg Pred Value : 0.38462         
##              Prevalence : 0.74571         
##          Detection Rate : 0.72286         
##    Detection Prevalence : 0.96286         
##       Balanced Accuracy : 0.51276         
##                                           
##        'Positive' Class : Uninjured       
## 

Across all Random Forest implementations, variable importance rankings consistently emphasize training load, risky movements, and recovery time, aligning with earlier exploratory findings. However, test-set performance remains modest. While specificity is high, sensitivity for injured athletes remains low, and ROC–AUC values indicate weak discrimination.

Hyperparameter tuning and feature removal improve fit to the majority class but do not substantially enhance identification of injured athletes. These results suggest that model flexibility alone cannot overcome the substantial overlap between injured and uninjured biometric profiles.

6.4 Naive Bayes

Naive Bayes was included as a probabilistic classifier that explicitly models class-conditional predictor distributions. After centering and scaling numeric predictors, the model achieves the highest AUC among the tested approaches, although the improvement is modest and only marginally exceeds chance-level discrimination.

## 6.4 Naive Bayes (Scaled Numeric Predictors)

# Identify numeric predictors (exclude outcome)
nb_numeric_vars <- numeric_vars

# Fit scaler on training data only
scaler <- preProcess(
  train_data[, nb_numeric_vars],
  method = c("center", "scale")
)

# Apply scaling
train_nb_scaled <- train_data
test_nb_scaled  <- test_data

train_nb_scaled[, nb_numeric_vars] <- predict(scaler, train_data[, nb_numeric_vars])
test_nb_scaled[, nb_numeric_vars]  <- predict(scaler, test_data[, nb_numeric_vars])

# Fit Naive Bayes model
nb_model_scaled <- naive_bayes(
  Injury_History ~ .,
  data = train_nb_scaled
)

# Predictions
nb_prob_scaled <- predict(nb_model_scaled, test_nb_scaled, type = "prob")[, "Injured"]
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
nb_pred_scaled <- predict(nb_model_scaled, test_nb_scaled)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
# ROC and AUC
nb_roc_scaled <- roc(test_nb_scaled$Injury_History, nb_prob_scaled)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(nb_roc_scaled)
## Area under the curve: 0.5483
# Confusion matrix
cm_nb_scaled <- confusionMatrix(nb_pred_scaled, test_nb_scaled$Injury_History)
cm_nb_scaled
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Uninjured Injured
##   Uninjured       250      85
##   Injured          11       4
##                                           
##                Accuracy : 0.7257          
##                  95% CI : (0.6758, 0.7718)
##     No Information Rate : 0.7457          
##     P-Value [Acc > NIR] : 0.8218          
##                                           
##                   Kappa : 0.0039          
##                                           
##  Mcnemar's Test P-Value : 9.297e-14       
##                                           
##             Sensitivity : 0.95785         
##             Specificity : 0.04494         
##          Pos Pred Value : 0.74627         
##          Neg Pred Value : 0.26667         
##              Prevalence : 0.74571         
##          Detection Rate : 0.71429         
##    Detection Prevalence : 0.95714         
##       Balanced Accuracy : 0.50140         
##                                           
##        'Positive' Class : Uninjured       
## 

Sensitivity remains limited, likely reflecting violations of the conditional independence assumption and extensive overlap between class-conditional distributions. As with other models, performance appears constrained by data structure rather than modeling choice.

6.5 Model Comparison Table

Performance metrics across all models are summarized in a unified comparison table and ROC visualization. While overall accuracy is consistently high, this metric is driven by correct classification of uninjured athletes. Sensitivity for injured athletes remains low across all approaches, and ROC–AUC values cluster near 0.5–0.6.

## 6.5.1 Model Comparison Table

model_comparison <- tibble(
  Model = c(
    "Logistic",
    "Logistic (AIC)",
    "Random Forest (Baseline)",
    "Random Forest (Optimized)",
    "Random Forest (No VO2)",
    "Naive Bayes (Scaled)"
  ),

  Accuracy = c(
    cm_log$overall["Accuracy"],
    cm_log_aic$overall["Accuracy"],
    cm_rf_base$overall["Accuracy"],
    cm_rf_best$overall["Accuracy"],
    cm_rf_no_vo2$overall["Accuracy"],
    cm_nb_scaled$overall["Accuracy"]
  ),

  Misclassification = 1 - Accuracy,

  Deviance = c(
    deviance(log_model),
    deviance(log_model_aic),
    NA, NA, NA, NA
  ),

  Sensitivity = c(
    cm_log$byClass["Sensitivity"],
    cm_log_aic$byClass["Sensitivity"],
    cm_rf_base$byClass["Sensitivity"],
    cm_rf_best$byClass["Sensitivity"],
    cm_rf_no_vo2$byClass["Sensitivity"],
    cm_nb_scaled$byClass["Sensitivity"]
  ),

  Specificity = c(
    cm_log$byClass["Specificity"],
    cm_log_aic$byClass["Specificity"],
    cm_rf_base$byClass["Specificity"],
    cm_rf_best$byClass["Specificity"],
    cm_rf_no_vo2$byClass["Specificity"],
    cm_nb_scaled$byClass["Specificity"]
  ),

  AUC = c(
    auc(log_roc),
    auc(log_aic_roc),
    auc(rf_base_roc),
    auc(rf_best_roc),
    auc(rf_no_vo2_roc),
    auc(nb_roc_scaled)
  )
)

model_comparison
## # A tibble: 6 × 7
##   Model        Accuracy Misclassification Deviance Sensitivity Specificity   AUC
##   <chr>           <dbl>             <dbl>    <dbl>       <dbl>       <dbl> <dbl>
## 1 Logistic        0.726             0.274     843.       0.973      0      0.535
## 2 Logistic (A…    0.746             0.254     859.       1          0      0.522
## 3 Random Fore…    0.743             0.257      NA        0.977      0.0562 0.503
## 4 Random Fore…    0.734             0.266      NA        0.969      0.0449 0.493
## 5 Random Fore…    0.737             0.263      NA        0.969      0.0562 0.502
## 6 Naive Bayes…    0.726             0.274      NA        0.958      0.0449 0.548
## ROC Curve Comparison Across Models

# Extract AUC values
auc_vals <- c(
  Logistic        = as.numeric(auc(log_roc)),
  Logistic_AIC    = as.numeric(auc(log_aic_roc)),
  RF_Baseline     = as.numeric(auc(rf_base_roc)),
  RF_Optimized    = as.numeric(auc(rf_best_roc)),
  RF_No_VO2       = as.numeric(auc(rf_no_vo2_roc)),
  Naive_Bayes     = as.numeric(auc(nb_roc_scaled))
)

# Base ROC
plot(
  log_roc,
  col = "black",
  lwd = 2,
  main = "ROC Curve Comparison Across Models"
)

# Add remaining models
lines(log_aic_roc, col = "blue", lwd = 2)
lines(rf_base_roc, col = "darkgreen", lwd = 2)
lines(rf_best_roc, col = "red", lwd = 2)
lines(rf_no_vo2_roc, col = "purple", lwd = 2)
lines(nb_roc_scaled, col = "orange", lwd = 2)

# Reference line
abline(0, 1, lty = 2, col = "gray")

# Legend
legend(
  "bottomright",
  legend = c(
    paste0("Logistic (AUC = ", round(auc_vals["Logistic"], 3), ")"),
    paste0("Logistic AIC (AUC = ", round(auc_vals["Logistic_AIC"], 3), ")"),
    paste0("RF Baseline (AUC = ", round(auc_vals["RF_Baseline"], 3), ")"),
    paste0("RF Optimized (AUC = ", round(auc_vals["RF_Optimized"], 3), ")"),
    paste0("RF No VO2 (AUC = ", round(auc_vals["RF_No_VO2"], 3), ")"),
    paste0("Naive Bayes (AUC = ", round(auc_vals["Naive_Bayes"], 3), ")")
  ),
  col = c("black", "blue", "darkgreen", "red", "purple", "orange"),
  lwd = 2,
  bty = "n",
  cex = 0.85
)

# Save trained Naive Bayes model
save(nb_model_scaled, file = "nb_model_scaled.RData")

# Save scaler
save(scaler, file = "scaler.RData")

# Save numeric variable list
save(nb_numeric_vars, file = "nb_numeric_vars.RData")

Although Random Forest and Naive Bayes marginally outperform logistic regression, no model demonstrates strong or clinically actionable predictive capability. Collectively, these findings reinforce the conclusion that injury prediction using cross-sectional biometric data is intrinsically difficult and that improved performance will likely require richer longitudinal, biomechanical, or contextual information rather than alternative classifiers alone.

7. Discussion and Conclusion

The objective of this project was to evaluate whether routinely collected biometric, behavioral, and training-related variables could be used to characterize and predict sports-related injury risk in a collegiate athlete population. Using a comprehensive analytical pipeline that integrated exploratory data analysis, dimensionality reduction, unsupervised clustering, and multiple supervised learning approaches, this study provides a balanced assessment of both the potential and the limitations of data-driven injury modeling.

Across all analytical stages, a consistent finding emerged: injury risk does not appear as a simple or separable pattern within the available data. Exploratory analyses revealed substantial overlap between injured and uninjured athletes across individual variables and multivariate relationships. Although injured athletes tended to exhibit higher training loads, greater engagement in risky movements, and reduced recovery time, these differences were gradual rather than categorical. No single variable, nor any low-dimensional combination of variables, offered strong discriminatory power. This pattern aligns with contemporary injury science, which views injury as an emergent outcome arising from cumulative stress, limited adaptation, and interacting physiological processes rather than isolated risk factors.

Dimensionality reduction and unsupervised learning analyses reinforced this interpretation. Principal Component Analysis revealed a high-dimensional variance structure, with no dominant latent axis summarizing injury-related variability. The absence of a clear elbow in the scree plot and the extensive overlap observed in PCA biplots indicate that biometric variation reflects multiple partially independent domains, including workload, fitness, recovery, and behavior. Injury status did not align cleanly with these axes, suggesting that risk is not encoded along a single physiological dimension.

Similarly, hierarchical and model-based clustering failed to identify discrete athlete subgroups corresponding to injury history. Cluster structures exhibited smooth gradients rather than sharp boundaries, and formal statistical tests showed weak or non-significant associations between cluster membership and injury status. These findings argue against stable “injury-prone” biometric phenotypes and instead support a continuum-based view of injury risk, in which athletes move dynamically between lower- and higher-risk states as training demands and recovery balance fluctuate.

Supervised predictive modeling highlighted the practical implications of this underlying data structure. Logistic regression, Random Forest, and Naive Bayes classifiers all demonstrated limited discrimination, with ROC–AUC values near chance and consistently low sensitivity for injured athletes. While overall accuracy appeared high, it was driven largely by correct classification of the uninjured majority and therefore overstated predictive performance. Across models, injured athletes remained difficult to identify using cross-sectional biometric data alone.

The modest gains achieved by more flexible models such as Random Forest and Naive Bayes indicate that performance limitations are not primarily methodological. Even when nonlinear interactions and probabilistic structure were modeled, substantial overlap between injured and uninjured predictor distributions constrained achievable performance. Variable importance analyses consistently highlighted training load, risky movements, and recovery-related measures, reinforcing their relevance for athlete monitoring. However, these variables contribute probabilistically rather than deterministically, increasing risk without defining clear decision thresholds.

These results underscore a central principle in applied machine learning: predictive performance is fundamentally limited by data structure and content. While the variables included capture meaningful aspects of physiological load and behavior, they represent only a subset of relevant injury determinants. Unmeasured factors such as biomechanical asymmetries, tissue-specific loading, psychological stress, prior injury severity, and contextual training decisions likely play important roles. Moreover, injury itself is not binary. Differences in injury type, severity, duration, and functional impact are not captured by the outcome variable used here.

The cross-sectional nature of the dataset further limits inference. Injury history is treated as a static label without information on timing relative to data collection, making it unclear whether observed biometrics reflect pre-injury risk, post-injury adaptation, or compensatory behavior. Without longitudinal tracking, acute-to-chronic workload transitions, delayed effects, and temporal risk evolution cannot be modeled, likely attenuating predictive signal.

Despite these limitations, this project makes several meaningful contributions. It demonstrates a rigorous and transparent analytical workflow, provides empirical support for conceptualizing injury risk as a continuous and multifactorial process, and consistently identifies workload, recovery, and behavioral measures as relevant monitoring signals, even when deterministic prediction is not achievable.

The interactive application developed as part of this work further enhances its translational value by enabling dynamic exploration of variable relationships and probabilistic risk estimates for hypothetical athlete profiles. While not intended as a diagnostic tool, it offers a structured framework for hypothesis generation and decision support. In applied settings, such tools are most effective when used to augment expert judgment rather than replace it.

In conclusion, this study demonstrates that predicting sports-related injuries using cross-sectional biometric data remains inherently challenging. The absence of strong predictive performance reflects the biological and contextual complexity of injury mechanisms rather than analytical shortcomings. These findings highlight the need for richer, longitudinal, and multidimensional data sources and for probabilistic risk frameworks that explicitly acknowledge uncertainty. By critically evaluating both the capabilities and limitations of biometric injury modeling, this work contributes to a more realistic and scientifically grounded understanding of how data-driven approaches can support injury prevention in collegiate athletics.

8. Appendix

8.1 Variable Dictionary

tibble(
Variable = c(numeric_vars, "Gender", "Sport", "Feedback_Level", "Injury_History"),
Description = c(
"Age in years",
"Heart rate variability",
"VO2 max",
"Speed performance index",
"Weekly endurance hours",
"Count of risky movements",
"Average sleep hours",
"Daily step count",
"Composite training load",
"Recovery time post training",
"Gender",
"Sport",
"Subjective feedback level",
"Injury status"
)
)
## # A tibble: 14 × 2
##    Variable        Description                
##    <chr>           <chr>                      
##  1 Age             Age in years               
##  2 HR_Variability  Heart rate variability     
##  3 VO2_Max         VO2 max                    
##  4 Speed_Index     Speed performance index    
##  5 Endurance_Hours Weekly endurance hours     
##  6 Risky_Moves     Count of risky movements   
##  7 Sleep_Hours     Average sleep hours        
##  8 Step_Count      Daily step count           
##  9 Training_Load   Composite training load    
## 10 Recovery_Time   Recovery time post training
## 11 Gender          Gender                     
## 12 Sport           Sport                      
## 13 Feedback_Level  Subjective feedback level  
## 14 Injury_History  Injury status

9. Shiny App: : Interactive Exploration of Injury Risk Patterns

The PSL4040 Athlete Injury Risk Explorer is an interactive R Shiny application designed to serve as a translational tool for exploring collegiate athlete data and calculating the probabilistic risk of sports-related injury.

It bridges formal statistical analysis with applied interpretation, focusing on exploration and explanation rather than serving as a definitive predictive decision tool. This design reflects the project’s finding that injury risk is a complex, multifactorial, and continuous process that resists simple, sharp classification.

Application Components and Analytical Value The application is organized into six tabs, offering a blend of data visualization, utility, and predictive modeling:

  1. Core Purpose and Predictive Modeling Goal: To support data exploration and calculate the probabilistic risk of injury using athlete-level variables.

Underlying Model: It uses a pre-trained Naive Bayes Machine Learning Model (nb_model_scaled) developed in a prior analysis.

Data Inputs (Key Variables):

Biometrics/Physiological: Age, HR Variability, VO2 Max, Speed Index.

Workload/Behavioral: Endurance Hours, Risky Moves, Sleep Hours, Step Count, Training Load, Recovery Time.

Categorical: Gender, Sport, Feedback Level.

Disclaimer: The app consistently reinforces that all output represents probabilistic risk, not deterministic outcomes or a diagnosis.

  1. Interactive Prediction (Injury Risk Prediction Tab) This feature allows users to apply the trained model to new data, promoting analytical transparency:

Manual Athlete Entry: Users input specific values for a single athlete’s 13 variables. The app processes this input and displays the single resulting Probability of Injury in a minimal table format.

CSV File Upload: Users can upload a dataset of multiple new athletes. The app calculates the injury probability for every athlete and presents the full results in a single, interactive, searchable, and pageable table.

  1. Exploratory Data Analysis (Scatter Plot, Box Plot, Categorical Distributions Tabs) These tabs are the primary means of understanding the complexity and overlap in the data, consistent with the principle of Interpretability:

Scatter Plot Explorer: Dynamically plots any two numeric variables, coloring points by a categorical group (e.g., Injury History) to assess bivariate relationships, overlap, and separation (or lack thereof).

Box Plot Explorer: Compares the distribution of a single numeric variable across different categorical groups.

Categorical Distributions: Visualizes the frequency counts and relationships between categorical variables.

Analytical Value: By allowing users to explore different variable combinations, the app makes it immediately clear that injury status does not correspond to clean thresholds in biometric variables, and overlap between injured and uninjured athletes is pervasive.

  1. Transparency and Utility (Data Table and Glossary Tabs) These components support the guiding principle of Transparency by providing direct access to the underlying information:

Data Table: An interactive view of the raw training dataset, allowing users to sort, search, and inspect individual athlete profiles to cross-reference plotted patterns with raw values.

Glossary: Provides a detailed definition for every variable used in the application (e.g., HR Variability, Training Load).