Sports-related injuries represent a critical and costly challenge in collegiate athletics, compromising athlete health, diminishing team performance, and impacting long-term career sustainability. Moving from reactive treatment to proactive prevention requires accurately identifying which athletes are at the highest risk before an incident occurs.
While advances in wearable sensors and athlete monitoring systems have enabled the continuous quantification of physiological and behavioral metrics, translating this data into actionable prevention strategies remains complex. Traditional monitoring often relies on isolated metrics or linear associations, such as a single workload threshold. However, contemporary research suggests these methods are insufficient for capturing the full “web of determinants” that leads to injury, where risk emerges from non-linear, multivariate interactions between training load, recovery status, and physiological capacity. This necessitates a systematic, data-driven approach.
This project focuses on identifying complex patterns within a collegiate athlete dataset by analyzing variables grouped into three critical categories. The primary challenge for any predictive model, including the one developed here, is to detect these subtle, compounding risk patterns:
Workload variables are central to injury prediction, but their correlation is paradoxical:
Training Load, Endurance Hours, and Step Count: The relationship is non-linear, governed by the Training-Injury Prevention Paradox. The main risk factor is not high training volume itself, but the rate of load change. High chronic workloads (consistent, long-term high training) are often protective, acting like a “vaccine” that builds physical resilience. Conversely, rapid, acute spikes in load compared to the athlete’s chronic average are the main trigger for soft-tissue injuries, as the body cannot adapt quickly enough.
Risky Moves: This variable is expected to show a positive correlation with risk. A higher count suggests repeated exposure to compromised movement quality or high biomechanical stress, leading to cumulative microtrauma and structural failure.
These variables measure the body’s capacity to handle demands and its internal stress response:
Heart Rate Variability (HRV): This generally shows a negative correlation (lower HRV = higher risk). Low HRV indicates the autonomic nervous system is dominated by the sympathetic (stress) branch, signaling high accumulated physiological stress, poor recovery, or maladaptation to training.
VO2 Max: This is expected to show a negative correlation (higher VO2 Max = lower risk). Higher cardiorespiratory fitness typically implies greater physical resilience and better capacity to manage high training loads without incurring excessive strain or fatigue-related movement errors.
Age: The correlation is often non-linear in the athletic population, with risk potentially elevated at the extremes of the collegiate range due to ongoing development or reduced tissue elasticity.
These variables dictate the body’s ability to repair and adapt, and are critical for mitigating risk:
Sleep Hours & Recovery Time: Both show a negative correlation (lower values = higher risk). Adequate sleep and recovery time are vital for physiological repair and neurocognitive recovery. Chronic deficits impair motor control, reduce reaction time, and compound residual fatigue, directly increasing the risk of sustaining a fatigue-related injury.
Feedback Level: The correlation is complex and indirect. As a proxy for the psychological environment, very high or very low feedback may point to an overly stressful or neglectful environment, which contributes to psychological stress—a known indirect risk factor for injury.
This assignment is designed to systematically address the challenges presented by this complex data through a two-fold approach:
Systematic Data Exploration: The primary goal is to systematically explore the provided collegiate athlete biometric dataset to identify meaningful, albeit often subtle, relationships between the key variables and the ultimate outcome: injury history. This initial exploration is crucial for understanding the data separation and overlap—or lack thereof—between injured and uninjured groups, which informs the expectations for subsequent predictive modeling.
Predictive Modeling and Application Development: The second goal is to leverage a Naive Bayes classification model to quantify the probabilistic risk of injury based on the input features. This model serves as the computational core for a fully functional, interactive Shiny application. This application provides external users (coaches, athletes, or researchers) a platform for intuitive data exploration and the practical application of the machine learning results, allowing them to input new data and receive a probabilistic injury risk score.
By employing the Naive Bayes model, this project attempts to detect the required nuanced multivariate patterns (e.g., flagging an athlete who exhibits low HRV AND low sleep hours AND a high acute training load) that traditional univariate analysis often misses. However, the successful application of machine learning in sports injury prediction is often constrained by real-world data limitations: datasets are typically small, noisy, and contain significant overlap between outcome classes. Therefore, while ML provides a powerful framework, the derived probabilities must be approached with caution. The model and its predictions are intended as a proof-of-concept and a platform for further analysis, not a definitive diagnostic tool. The predictive strength is limited by the current dataset, underscoring that the app’s greatest utility lies in its framework and ability to facilitate intuitive data exploration.
To establish the analytical framework, several R packages were loaded to support data manipulation, visualization, modeling, and evaluation. The tidyverse suite was used for data import, cleaning, and manipulation, as well as for foundational visualization. Exploratory data analysis was supported by ggplot2, GGally, pheatmap, and corrplot, enabling visualization of distributions, pairwise relationships, heatmaps, and correlation structures.
Supervised machine learning was implemented using caret for model training and evaluation workflows, randomForest for ensemble-based classification, and naivebayes for probabilistic modeling. Model performance was assessed using receiver operating characteristic (ROC) analysis via the pROC package.
Unsupervised learning was supported by the cluster, mclust, and factoextra packages, which enabled hierarchical, distance-based, and model-based clustering as well as visualization of principal component and clustering results. The rsample package was used for data splitting and resampling procedures, and broom was used to tidy and organize model outputs for downstream analysis and reporting.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(pheatmap)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(GGally)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(rsample)
##
## Attaching package: 'rsample'
##
## The following object is masked from 'package:caret':
##
## calibration
library(cluster)
library(mclust)
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
##
## The following object is masked from 'package:dplyr':
##
## count
##
## The following object is masked from 'package:purrr':
##
## map
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(naivebayes)
## naivebayes 1.0.0 loaded
## For more information please visit:
## https://majkamichal.github.io/naivebayes/
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(broom)
athletes <- read_csv("College_Sports_Dataset.csv")
## Rows: 1050 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Athlete_ID, Gender, Sport
## dbl (12): Age, HR_Variability, VO2_Max, Speed_Index, Endurance_Hours, Risky_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Required R packages were loaded to support data manipulation, visualization, and modeling. The read_csv() function imported the dataset containing 1,050 observations and 15 variables. The initial column specification confirmed that the data includes a mix of character types (e.g., Athlete_ID, Gender) and double-precision numeric types (e.g., HR_Variability, Training_Load), indicating that the raw data structure was successfully preserved during import.
Prior to exploratory analysis and predictive modeling, the dataset must be structured to reflect the measurement scale and analytical role of each variable. Inconsistent encoding or inappropriate data types can violate model assumptions, bias results, or produce silent computational errors. This section outlines the preprocessing steps used to prepare the athlete biometric dataset for downstream statistical analysis and machine learning by enforcing correct categorical encoding, numeric representation, and structural integrity.
Categorical variables were explicitly converted to factors to ensure appropriate handling by classification algorithms. Using mutate() from the dplyr package, Gender, Sport, and Feedback_Level were recoded as factors. The outcome variable, Injury_History, was converted from a binary numeric indicator (0/1) to a labeled factor with levels “Uninjured” and “Injured”, ensuring that modeling functions treat the outcome as a categorical class rather than a numeric quantity.
athletes <- athletes %>%
mutate(
Gender = factor(Gender),
Sport = factor(Sport),
Feedback_Level = factor(Feedback_Level),
Injury_History = factor(
Injury_History,
levels = c(0, 1),
labels = c("Uninjured", "Injured")
)
)
This encoding standardizes the treatment of categorical predictors and establishes Injury_History as a discrete classification outcome across all supervised learning models used in this study.
Continuous predictors were explicitly identified and coerced to numeric format to ensure compatibility with analyses that assume numeric input, including correlation analysis, principal component analysis (PCA), clustering, and distance-based methods. A centralized vector containing the names of biometric, behavioral, and workload-related variables was defined to control which predictors were included in these analyses.
numeric_vars <- c(
"Age", "HR_Variability", "VO2_Max", "Speed_Index",
"Endurance_Hours", "Risky_Moves", "Sleep_Hours",
"Step_Count", "Training_Load", "Recovery_Time"
)
athletes[numeric_vars] <- lapply(athletes[numeric_vars], as.numeric)
Explicitly enforcing numeric data types prevents unintended coercion during matrix operations and model fitting, while the use of a centralized variable list improves reproducibility and consistency across subsequent analytical workflows.
Following preprocessing, structural validation was performed to confirm that the dataset conforms to analytical expectations and that no unintended transformations occurred during recoding. Dataset structure and variable types were inspected using glimpse(), and the distribution of the injury outcome was summarized using frequency and proportional counts.
glimpse(athletes)
## Rows: 1,050
## Columns: 15
## $ Athlete_ID <chr> "A0001", "A0002", "A0003", "A0004", "A0005", "A0006", …
## $ Age <dbl> 24, 21, 22, 24, 20, 22, 22, 24, 19, 20, 24, 20, 20, 22…
## $ Gender <fct> M, F, M, M, M, F, F, F, F, M, F, F, M, M, M, F, M, M, …
## $ Sport <fct> Tennis, Volleyball, Swimming, Basketball, Basketball, …
## $ HR_Variability <dbl> 51.39, 51.20, 69.62, 59.47, 52.19, 56.29, 49.89, 59.33…
## $ VO2_Max <dbl> 38.07, 52.25, 40.50, 41.16, 48.19, 43.57, 46.54, 34.22…
## $ Speed_Index <dbl> 6.71, 5.47, 5.61, 7.88, 6.14, 5.95, 5.16, 7.35, 5.44, …
## $ Endurance_Hours <dbl> 4.18, 3.19, 4.77, 4.29, 3.09, 2.72, 3.22, 4.67, 3.76, …
## $ Risky_Moves <dbl> 1, 3, 1, 4, 2, 3, 2, 2, 1, 1, 1, 4, 1, 2, 1, 2, 3, 2, …
## $ Sleep_Hours <dbl> 7.8, 6.1, 6.4, 7.9, 6.3, 6.8, 6.0, 8.8, 7.5, 6.7, 6.6,…
## $ Step_Count <dbl> 14229, 9239, 14566, 11391, 12213, 11997, 11006, 11039,…
## $ Injury_History <fct> Uninjured, Injured, Uninjured, Uninjured, Injured, Inj…
## $ Training_Load <dbl> 392, 361, 250, 202, 327, 318, 257, 324, 399, 312, 254,…
## $ Recovery_Time <dbl> 30.1, 21.0, 24.3, 25.9, 16.6, 22.6, 30.6, 29.4, 22.5, …
## $ Feedback_Level <fct> 5, 4, 1, 3, 2, 4, 4, 5, 4, 5, 5, 3, 1, 4, 1, 1, 1, 5, …
injury_tab <- table(athletes$Injury_History)
injury_tab
##
## Uninjured Injured
## 743 307
prop.table(injury_tab)
##
## Uninjured Injured
## 0.707619 0.292381
The inspection confirms that the dataset contains 1,050 collegiate athletes across 15 variables, with categorical fields correctly encoded as factors and biometric and workload-related variables stored as numeric values. The injury outcome distribution includes 743 uninjured athletes (70.8%) and 307 injured athletes (29.2%), indicating a moderate class imbalance.
This imbalance has direct implications for model evaluation, as accuracy-based metrics may be biased toward the majority class. Consequently, subsequent analyses emphasize performance measures that better capture discrimination and minority-class performance, including sensitivity, specificity, balanced accuracy, and ROC–AUC.
Overall, these preprocessing and validation steps establish a consistent and reliable analytical foundation for the exploratory and predictive analyses that follow.
Exploratory data analysis (EDA) was conducted to examine the structure, distributions, and relationships within the collegiate athlete biometric dataset prior to formal modeling. The primary goals were to verify data quality and physiological plausibility, and to assess whether injury status exhibits discernible univariate or low-dimensional patterns that could support straightforward classification. Given the multifactorial nature of sports-related injuries, EDA also serves to determine whether more complex multivariate or nonlinear approaches are warranted.
Summary statistics were generated to obtain a high-level overview of variable distributions and to identify potential data quality issues, including implausible values or preprocessing errors. Numeric variables were summarized using standard descriptive measures, while categorical variables were summarized by frequency.
summary(athletes[numeric_vars])
## Age HR_Variability VO2_Max Speed_Index
## Min. :18.00 Min. :30.98 Min. :28.80 Min. : 3.560
## 1st Qu.:19.00 1st Qu.:53.88 1st Qu.:41.65 1st Qu.: 5.870
## Median :21.00 Median :60.84 Median :45.02 Median : 6.500
## Mean :20.95 Mean :60.61 Mean :44.97 Mean : 6.519
## 3rd Qu.:23.00 3rd Qu.:67.46 3rd Qu.:48.36 3rd Qu.: 7.168
## Max. :24.00 Max. :92.60 Max. :60.61 Max. :10.160
## Endurance_Hours Risky_Moves Sleep_Hours Step_Count
## Min. :1.030 Min. :0.00 Min. : 3.800 Min. : 8009
## 1st Qu.:3.430 1st Qu.:1.00 1st Qu.: 6.200 1st Qu.: 9844
## Median :4.020 Median :2.00 Median : 7.000 Median :11587
## Mean :3.992 Mean :1.99 Mean : 6.952 Mean :11585
## 3rd Qu.:4.550 3rd Qu.:3.00 3rd Qu.: 7.600 3rd Qu.:13446
## Max. :6.270 Max. :7.00 Max. :10.200 Max. :14989
## Training_Load Recovery_Time
## Min. :200.0 Min. : 9.90
## 1st Qu.:249.0 1st Qu.:20.70
## Median :300.5 Median :24.00
## Mean :300.2 Mean :24.09
## 3rd Qu.:351.8 3rd Qu.:27.50
## Max. :399.0 Max. :41.10
summary(dplyr::select(athletes, Gender, Sport, Injury_History, Feedback_Level))
## Gender Sport Injury_History Feedback_Level
## F:499 Athletics :191 Uninjured:743 1:237
## M:551 Basketball:182 Injured :307 2:217
## Soccer :166 3:197
## Swimming :188 4:209
## Tennis :172 5:190
## Volleyball:151
The numeric summaries indicate physiologically realistic and internally consistent ranges across biometric and workload-related variables. Measures such as VO₂ max, sleep duration, training load, and step count fall within expected ranges for collegiate athletes, and no extreme anomalies are observed. Categorical summaries confirm representation across genders and sports, with injury history reflecting a higher proportion of uninjured athletes, consistent with real-world athletic populations.
Linear associations among numeric predictors were examined to assess shared structure and potential multicollinearity. A Pearson correlation matrix was computed and visualized to facilitate identification of correlated variable groups.
corr_mat <- cor(athletes[numeric_vars], use = "pairwise.complete.obs")
corrplot(
corr_mat,
method = "color",
type = "upper",
tl.cex = 0.8,
addCoef.col = NA
)
The correlation matrix reveals moderate positive associations among workload-related variables such as training load, step count, and endurance hours, as well as among aerobic fitness indicators including VO₂ max and speed index. No correlations approach thresholds typically associated with problematic multicollinearity, indicating that predictors are related but not redundant and can be jointly considered in multivariable analyses.
Pairwise relationships among key predictors were examined using matrix-style visualization to assess whether injury status is associated with separable patterns across variable combinations.
ggally_vars <- athletes %>%
dplyr::select(
HR_Variability, VO2_Max, Speed_Index,
Training_Load, Recovery_Time, Sleep_Hours,
Risky_Moves, Injury_History
)
ggpairs(
ggally_vars,
aes(color = Injury_History, alpha = 0.5)
) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Across most variable pairs, injured and uninjured athletes exhibit substantial overlap, indicating limited separability based on pairwise relationships alone. Correlation patterns observed here are consistent with those identified in the correlation matrix. While injured athletes appear slightly more concentrated at higher training loads and risky movement counts, particularly in the context of shorter recovery times, these trends are subtle and do not define clear boundaries between groups.
To examine group-level distributional differences, density plots were generated for selected workload, recovery, and physiological variables.
density_vars <- c(
"Training_Load", "Sleep_Hours",
"HR_Variability", "Recovery_Time",
"VO2_Max", "Risky_Moves"
)
for (v in density_vars) {
print(
ggplot(athletes, aes(x = .data[[v]], fill = Injury_History)) +
geom_density(alpha = 0.4) +
theme_minimal() +
labs(title = paste("Density of", v, "by Injury Status"))
)
}
Across all variables, density curves for injured and uninjured athletes overlap substantially, indicating weak univariate discrimination. Modest shifts are observed for training load, risky movements, and recovery time, suggesting these variables may contribute to injury risk in combination rather than independently.
Boxplots were used to complement density plots by highlighting group medians, variability, and potential outliers.
for (v in density_vars) {
print(
ggplot(athletes, aes(x = Injury_History, y = .data[[v]], fill = Injury_History)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = paste("Boxplot of", v, "by Injury Status"), x = "")
)
}
These visualizations confirm that injured athletes tend to exhibit slightly higher central values for workload and risky movement metrics, though substantial overlap persists across groups. No evidence of strong threshold effects or extreme outliers is observed.
Class balance was examined to inform subsequent model evaluation strategies.
ggplot(athletes, aes(x = Injury_History, fill = Injury_History)) +
geom_bar() +
theme_minimal() +
labs(title = "Class Distribution: Injury vs No Injury")
The distribution indicates a larger uninjured class relative to injured athletes, reflecting moderate class imbalance. This motivates the use of evaluation metrics that emphasize minority-class performance rather than overall accuracy alone.
Injury prevalence was examined across sports to assess whether sport-specific differences are evident.
injury_by_sport <- athletes %>%
dplyr::count(Sport, Injury_History) %>%
group_by(Sport) %>%
mutate(prop = n / sum(n))
ggplot(injury_by_sport, aes(x = Sport, y = prop, fill = Injury_History)) +
geom_col(position = "dodge") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
While injury proportions vary across sports, no sport exhibits extreme divergence. This suggests that sport alone is insufficient as a predictor and should be interpreted in conjunction with workload, recovery, and biometric variables.
Collectively, the exploratory analyses demonstrate that injury risk in collegiate athletes is not driven by any single variable or simple low-dimensional pattern. Instead, risk appears to arise from gradual shifts across multiple domains, with substantial overlap between injured and uninjured groups. These findings strongly motivate the use of multivariate, probabilistic, and potentially nonlinear modeling approaches in subsequent analyses.
Dimensionality reduction techniques were applied to assess whether the high-dimensional biometric dataset can be meaningfully summarized by a smaller number of latent structures. Many biometric and training variables reflect overlapping physiological processes, including aerobic capacity, workload accumulation, and recovery. These methods are used here strictly for exploratory diagnosis rather than prediction, with the goal of determining whether injury status aligns with dominant low-dimensional structure or remains distributed across multiple interacting dimensions.
Principal Component Analysis (PCA) was used to examine whether variance in the numeric biometric dataset concentrates along a small number of orthogonal axes. By transforming correlated predictors into uncorrelated components, PCA allows evaluation of whether injury risk aligns with dominant physiological dimensions rather than with individual variables.
PCA was performed using the prcomp() function with centering and scaling to account for heterogeneous measurement units. Variance explained by each component was examined using the summary() output.
pca_vars <- athletes %>% dplyr::select(all_of(numeric_vars))
# Preserve athlete identity for PCA and clustering
pca_vars <- athletes %>%
dplyr::select(all_of(numeric_vars))
rownames(pca_vars) <- athletes$Athlete_ID
## Warning: Setting row names on a tibble is deprecated.
ath_pca <- prcomp(pca_vars, center = TRUE, scale. = TRUE)
summary(ath_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.0736 1.0590 1.0331 1.0203 1.0145 0.99321 0.98433
## Proportion of Variance 0.1153 0.1121 0.1067 0.1041 0.1029 0.09865 0.09689
## Cumulative Proportion 0.1153 0.2274 0.3341 0.4382 0.5412 0.63980 0.73669
## PC8 PC9 PC10
## Standard deviation 0.95790 0.93557 0.91664
## Proportion of Variance 0.09176 0.08753 0.08402
## Cumulative Proportion 0.82845 0.91598 1.00000
Variance is distributed relatively evenly across components. The first principal component explains approximately 11.5% of total variance, with subsequent components each contributing comparable proportions. Approximately 54% of variance is captured by the first five components, and over 90% only after nine components. This pattern indicates that no single latent factor dominates variability and that the dataset retains substantial effective dimensionality.
The scree plot was used to visually assess whether a small subset of components captures a disproportionate share of variance.
fviz_eig(ath_pca, addlabels = TRUE)
## Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
## Ignoring empty aesthetic: `width`.
The plot shows a smooth decline in explained variance without a pronounced elbow, indicating that aggressive dimensionality reduction would result in substantial information loss. PCA therefore functions here as a diagnostic tool rather than a compression strategy for modeling.
The PCA biplot was used to visualize both variable loadings and athlete projections in reduced-dimensional space, allowing assessment of whether injured and uninjured athletes occupy distinct regions along dominant axes.
fviz_pca_biplot(
ath_pca,
geom.ind = "point",
habillage = athletes$Injury_History,
addEllipses = TRUE,
label = "var",
repel = TRUE
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Fitness-related variables (e.g., VO₂ max, endurance hours) load strongly along one principal axis, while workload and behavioral variables (e.g., training load, risky movements) load along another, with recovery measures projecting in opposing directions. Injured and uninjured athletes exhibit substantial overlap, although injured athletes tend to project modestly toward higher training stress and risk-associated directions. This pattern suggests gradual risk gradients rather than discrete separation in linear component space.
The biplot shows that aerobic fitness variables load strongly along one principal axis, whereas training stress and behavioral variables load along another. Recovery-related measures project in opposing directions, reflecting physiological trade-offs between workload accumulation and recovery capacity.
Injured and uninjured athletes exhibit substantial overlap in PCA space, and confidence ellipses largely intersect. However, injured athletes tend to project modestly toward directions associated with higher training stress and riskier movement profiles.
This visualization reinforces earlier EDA findings: injury risk does not correspond to a distinct linear combination of biometric variables but appears to increase gradually along axes associated with cumulative stress and insufficient recovery. These patterns support modeling injury risk as a probabilistic outcome rather than a discrete category.
Hierarchical clustering is used to explore whether athletes naturally group into biometric subprofiles based solely on physiological, behavioral, and workload-related variables, without incorporating injury status. Unlike supervised models, clustering can reveal latent structure or athlete “profiles” that may correspond to different risk states. Hierarchical clustering was used to explore whether athletes naturally group into biometric profiles based on physiological, behavioral, and workload variables alone, without reference to injury status. A three-cluster solution was examined to reflect biologically plausible risk strata rather than binary injury classification.
All numeric variables were scaled prior to distance computation. Euclidean distance and Ward’s linkage were used to favor compact clusters.
pca_scaled <- scale(pca_vars)
dist_mat <- dist(pca_scaled)
hc <- hclust(dist_mat, method = "ward.D2")
plot(hc, labels = FALSE)
rect.hclust(hc, k = 3, border = "red")
The dendrogram reveals gradual branching without sharply separated clusters. Cutting the tree at k = 3 yields 1 very small cluster, one intermediate one, and one very large cluster, indicating that a group dominates the data (uninjured as we have shown before) but that there might be an intermediate group of uninjured but with injured characteristics that may be at risk of injury in the future.
A heatmap was used to link clustering structure to underlying biometric patterns and to examine alignment with injury status.
# Build annotation using PCA row order
anno_athletes <- athletes %>%
dplyr::select(Athlete_ID, Injury_History) %>%
tibble::column_to_rownames("Athlete_ID") %>%
.[rownames(pca_scaled), , drop = FALSE]
# Sanity check (should be TRUE)
identical(rownames(anno_athletes), rownames(pca_scaled))
## [1] TRUE
# Heatmap
pheatmap(
mat = t(pca_scaled),
annotation_col = anno_athletes,
show_colnames = FALSE,
clustering_method = "ward.D2",
clustering_distance_cols = "euclidean",
fontsize = 6,
main = "Hierarchical Clustering Heatmap of Athlete Biometrics"
)
The heatmap displays smooth gradients across variables rather than abrupt transitions between clusters. Training load, risky movements, and recovery time show greater variability than fitness measures (e.g., VO2_Max and HR_Variability), while injury status does not align cleanly with any cluster.
To formally assess whether cluster membership is associated with injury history, a chi-square test was performed.
clusters <- cutree(hc, k = 3)
cluster_table <- table(clusters, athletes$Injury_History)
cluster_table
##
## clusters Uninjured Injured
## 1 575 224
## 2 120 50
## 3 48 33
chisq.test(cluster_table)
##
## Pearson's Chi-squared test
##
## data: cluster_table
## X-squared = 5.7414, df = 2, p-value = 0.05666
The test indicates a non-significant association, confirming that hierarchical clusters do not correspond to injury status.
Model-based clustering was applied to determine whether probabilistic Gaussian mixture models identify latent structure more effectively and to objectively select cluster number using the Bayesian Information Criterion (BIC).
mc <- Mclust(pca_scaled)
summary(mc)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust EEI (diagonal, equal volume and shape) model with 4 components:
##
## log-likelihood n df BIC ICL
## -14692.89 1050 53 -29754.48 -30062.31
##
## Clustering table:
## 1 2 3 4
## 254 254 272 270
plot(mc, what = "BIC")
plot(mc, what = "classification")
The BIC curve shows limited improvement with increasing model complexity. The selected solution identifies multiple components of similar size with substantial overlap, and cluster assignments do not align with injury status.
Across PCA, hierarchical clustering, and model-based clustering, biometric variability does not organize into discrete injury-defined groups or low-dimensional structures. Instead, variability is distributed across multiple partially independent axes, with injury risk increasing gradually along dimensions associated with cumulative workload, insufficient recovery, and risky movement behavior. These findings justify the use of supervised, probabilistic modeling approaches in subsequent sections and underscore the inherent complexity of injury prediction in athletic populations.
This section evaluates whether biometric, training, and demographic variables can predict injury history at the individual athlete level using supervised learning. Three complementary modeling paradigms were selected to reflect different assumptions about the injury-generation process: logistic regression (linear effects), Random Forest (nonlinear interactions), and Naive Bayes (probabilistic class-conditional structure).
All models were trained on a designated training set and evaluated on a held-out test set to assess generalization performance. Given the moderate class imbalance and the clinical relevance of identifying injured athletes, evaluation emphasizes discrimination and class-specific metrics rather than overall accuracy alone.
To enable unbiased model evaluation, the dataset was randomly partitioned into training (two-thirds) and test (one-third) subsets. Balance diagnostics were performed to confirm that the split preserved outcome prevalence and predictor distributions, thereby minimizing sampling artifacts that could confound model comparisons.
set.seed(123)
model_data <- athletes %>%
dplyr::select(Injury_History, all_of(numeric_vars), Gender, Sport, Feedback_Level)
# Create random 2/3–1/3 split
test_index <- sample(seq_len(nrow(model_data)), size = round(nrow(model_data) / 3))
train_data <- model_data[-test_index, ]
test_data <- model_data[test_index, ]
# Add indicator for balance checks
model_data$subset <- "train"
model_data$subset[test_index] <- "test"
model_data$subset <- factor(model_data$subset)
# Dimensions
dim(train_data)
## [1] 700 14
dim(test_data)
## [1] 350 14
# Outcome balance
prop.table(table(train_data$Injury_History))
##
## Uninjured Injured
## 0.6885714 0.3114286
prop.table(table(test_data$Injury_History))
##
## Uninjured Injured
## 0.7457143 0.2542857
# Numeric balance checks
numeric_balance <- model_data %>% dplyr::select(where(is.numeric), subset)
numeric_balance %>%
pivot_longer(cols = -subset, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = subset, y = value)) +
facet_wrap(~variable, scales = "free") +
geom_boxplot() +
theme_minimal()
numeric_pvals <- numeric_balance %>%
dplyr::select(-subset) %>%
summarise(across(everything(), ~ t.test(.x ~ model_data$subset)$p.value))
numeric_pvals
## # A tibble: 1 × 10
## Age HR_Variability VO2_Max Speed_Index Endurance_Hours Risky_Moves
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.337 0.699 0.771 0.0363 0.976 0.0853
## # ℹ 4 more variables: Sleep_Hours <dbl>, Step_Count <dbl>, Training_Load <dbl>,
## # Recovery_Time <dbl>
# Categorical balance checks
categorical_balance <- model_data %>% dplyr::select(where(is.factor), subset)
categorical_balance %>%
pivot_longer(cols = -subset, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = subset, fill = value)) +
facet_wrap(~variable) +
geom_bar(position = "fill") +
theme_minimal()
chisq_outcome <- model_data %>%
dplyr::count(Injury_History, subset) %>%
pivot_wider(names_from = subset, values_from = n, values_fill = 0) %>%
column_to_rownames("Injury_History") %>%
chisq.test()
chisq_outcome$p.value
## [1] 0.06474104
Visual inspection and formal tests indicate that injury prevalence, numeric predictor distributions, and categorical variable proportions are comparable between subsets. These checks confirm that downstream performance differences can be attributed to model behavior rather than structural differences introduced during data partitioning.
Logistic regression was used as an interpretable baseline model, estimating injury probability as a linear combination of biometric, workload, and contextual predictors. This specification reflects a standard multivariable approach in sports injury research and provides a benchmark for evaluating more flexible models.
log_model <- glm(
Injury_History ~ Age + HR_Variability + VO2_Max + Speed_Index +
Endurance_Hours + Risky_Moves + Sleep_Hours + Step_Count +
Training_Load + Recovery_Time + Gender + Sport + Feedback_Level,
data = train_data,
family = binomial
)
summary(log_model)
##
## Call:
## glm(formula = Injury_History ~ Age + HR_Variability + VO2_Max +
## Speed_Index + Endurance_Hours + Risky_Moves + Sleep_Hours +
## Step_Count + Training_Load + Recovery_Time + Gender + Sport +
## Feedback_Level, family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.485e+00 1.728e+00 -2.017 0.04372 *
## Age 8.560e-02 4.228e-02 2.024 0.04293 *
## HR_Variability 7.946e-03 8.173e-03 0.972 0.33097
## VO2_Max -1.091e-02 1.718e-02 -0.635 0.52531
## Speed_Index -2.918e-02 8.882e-02 -0.328 0.74254
## Endurance_Hours 1.084e-01 1.036e-01 1.047 0.29530
## Risky_Moves 6.906e-02 5.535e-02 1.248 0.21215
## Sleep_Hours -3.967e-02 8.429e-02 -0.471 0.63789
## Step_Count 7.549e-05 4.070e-05 1.855 0.06363 .
## Training_Load -2.314e-03 1.464e-03 -1.580 0.11401
## Recovery_Time 6.557e-03 1.632e-02 0.402 0.68778
## GenderM 2.328e-01 1.683e-01 1.383 0.16665
## SportBasketball 4.058e-01 2.926e-01 1.387 0.16549
## SportSoccer 4.972e-01 3.010e-01 1.652 0.09859 .
## SportSwimming 2.505e-01 2.933e-01 0.854 0.39303
## SportTennis 7.987e-01 2.864e-01 2.789 0.00529 **
## SportVolleyball 5.056e-01 3.086e-01 1.638 0.10134
## Feedback_Level2 -2.803e-01 2.623e-01 -1.069 0.28526
## Feedback_Level3 -1.068e-01 2.638e-01 -0.405 0.68549
## Feedback_Level4 -6.421e-02 2.578e-01 -0.249 0.80332
## Feedback_Level5 -6.699e-02 2.600e-01 -0.258 0.79666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 868.33 on 699 degrees of freedom
## Residual deviance: 842.97 on 679 degrees of freedom
## AIC: 884.97
##
## Number of Fisher Scoring iterations: 4
log_prob <- predict(log_model, test_data, type = "response")
log_pred <- factor(ifelse(log_prob > 0.5, "Injured", "Uninjured"),
levels = c("Uninjured", "Injured"))
log_roc <- roc(test_data$Injury_History, log_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(log_roc)
## Area under the curve: 0.5354
cm_log <- confusionMatrix(log_pred, test_data$Injury_History)
cm_log
## Confusion Matrix and Statistics
##
## Reference
## Prediction Uninjured Injured
## Uninjured 254 89
## Injured 7 0
##
## Accuracy : 0.7257
## 95% CI : (0.6758, 0.7718)
## No Information Rate : 0.7457
## P-Value [Acc > NIR] : 0.8218
##
## Kappa : -0.0385
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9732
## Specificity : 0.0000
## Pos Pred Value : 0.7405
## Neg Pred Value : 0.0000
## Prevalence : 0.7457
## Detection Rate : 0.7257
## Detection Prevalence : 0.9800
## Balanced Accuracy : 0.4866
##
## 'Positive' Class : Uninjured
##
Test-set evaluation reveals limited discrimination. Although overall accuracy is high, this reflects correct classification of the uninjured majority. ROC–AUC values are low, and sensitivity for injured athletes is poor, indicating that linear decision boundaries do not adequately capture injury risk in this dataset.
### Stepwise AIC Logistic Regression
log_model_aic <- stats::step(log_model, direction = "both", trace = FALSE)
summary(log_model_aic)
##
## Call:
## glm(formula = Injury_History ~ Age + Step_Count + Training_Load,
## family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.569e+00 1.052e+00 -2.443 0.0146 *
## Age 8.088e-02 4.095e-02 1.975 0.0483 *
## Step_Count 6.957e-05 3.974e-05 1.750 0.0800 .
## Training_Load -2.497e-03 1.434e-03 -1.741 0.0817 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 868.33 on 699 degrees of freedom
## Residual deviance: 858.83 on 696 degrees of freedom
## AIC: 866.83
##
## Number of Fisher Scoring iterations: 4
log_aic_prob <- predict(log_model_aic, test_data, type = "response")
log_aic_pred <- factor(ifelse(log_aic_prob > 0.5, "Injured", "Uninjured"),
levels = c("Uninjured", "Injured"))
log_aic_roc <- roc(test_data$Injury_History, log_aic_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(log_aic_roc)
## Area under the curve: 0.5219
cm_log_aic <- confusionMatrix(log_aic_pred, test_data$Injury_History)
cm_log_aic
## Confusion Matrix and Statistics
##
## Reference
## Prediction Uninjured Injured
## Uninjured 261 89
## Injured 0 0
##
## Accuracy : 0.7457
## 95% CI : (0.6967, 0.7905)
## No Information Rate : 0.7457
## P-Value [Acc > NIR] : 0.5285
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.7457
## Neg Pred Value : NaN
## Prevalence : 0.7457
## Detection Rate : 0.7457
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : Uninjured
##
Stepwise variable selection using AIC was applied to assess whether a more parsimonious linear model improves generalization. While this procedure reduces model complexity, test-set performance remains largely unchanged. Discrimination and sensitivity do not improve meaningfully, suggesting that limited predictive performance is driven by weak linear signal rather than overfitting.
Together, the logistic regression results highlight the limitations of purely linear models for injury prediction and motivate exploration of nonlinear approaches.
Random Forest models were applied to capture nonlinear relationships and higher-order interactions among predictors. Baseline, tuned, and reduced-feature variants were evaluated to assess robustness and sensitivity to model specification.
## 6.3.1 Baseline Random Forest Model
set.seed(123)
rf_base <- randomForest(
Injury_History ~ .,
data = train_data,
ntree = 500,
importance = TRUE
)
# Variable importance
varImpPlot(
rf_base,
main = "Variable Importance – Baseline Random Forest"
)
# Predictions
rf_base_prob <- predict(rf_base, test_data, type = "prob")[, "Injured"]
rf_base_pred <- predict(rf_base, test_data)
# ROC and AUC
rf_base_roc <- roc(test_data$Injury_History, rf_base_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls < cases
auc(rf_base_roc)
## Area under the curve: 0.5031
# Confusion matrix
cm_rf_base <- confusionMatrix(rf_base_pred, test_data$Injury_History)
cm_rf_base
## Confusion Matrix and Statistics
##
## Reference
## Prediction Uninjured Injured
## Uninjured 255 84
## Injured 6 5
##
## Accuracy : 0.7429
## 95% CI : (0.6937, 0.7878)
## No Information Rate : 0.7457
## P-Value [Acc > NIR] : 0.5768
##
## Kappa : 0.0467
##
## Mcnemar's Test P-Value : 4.798e-16
##
## Sensitivity : 0.97701
## Specificity : 0.05618
## Pos Pred Value : 0.75221
## Neg Pred Value : 0.45455
## Prevalence : 0.74571
## Detection Rate : 0.72857
## Detection Prevalence : 0.96857
## Balanced Accuracy : 0.51660
##
## 'Positive' Class : Uninjured
##
Hyperparameter tuning explored different values of mtry and node size.
## 6.3.2 Random Forest Hyperparameter Optimization
set.seed(682)
rf_1 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 2, nodesize = 5, importance = TRUE)
rf_2 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 2, nodesize = 10, importance = TRUE)
rf_3 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 4, nodesize = 5, importance = TRUE)
rf_4 <- randomForest(Injury_History ~ ., data = train_data,
mtry = 4, nodesize = 10, importance = TRUE)
# Compare training confusion matrices
rf_1$confusion
## Uninjured Injured class.error
## Uninjured 473 9 0.0186722
## Injured 216 2 0.9908257
rf_2$confusion
## Uninjured Injured class.error
## Uninjured 476 6 0.01244813
## Injured 216 2 0.99082569
rf_3$confusion
## Uninjured Injured class.error
## Uninjured 462 20 0.04149378
## Injured 213 5 0.97706422
rf_4$confusion
## Uninjured Injured class.error
## Uninjured 469 13 0.02697095
## Injured 212 6 0.97247706
# Select best-performing model (example: rf_4)
rf_best <- rf_4
# Variable importance
varImpPlot(
rf_best,
main = "Variable Importance – Optimized Random Forest"
)
# Predictions
rf_best_prob <- predict(rf_best, test_data, type = "prob")[, "Injured"]
rf_best_pred <- predict(rf_best, test_data)
# ROC and AUC
rf_best_roc <- roc(test_data$Injury_History, rf_best_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(rf_best_roc)
## Area under the curve: 0.4925
# Confusion matrix
cm_rf_best <- confusionMatrix(rf_best_pred, test_data$Injury_History)
cm_rf_best
## Confusion Matrix and Statistics
##
## Reference
## Prediction Uninjured Injured
## Uninjured 253 85
## Injured 8 4
##
## Accuracy : 0.7343
## 95% CI : (0.6847, 0.7798)
## No Information Rate : 0.7457
## P-Value [Acc > NIR] : 0.712
##
## Kappa : 0.02
##
## Mcnemar's Test P-Value : 3.252e-15
##
## Sensitivity : 0.96935
## Specificity : 0.04494
## Pos Pred Value : 0.74852
## Neg Pred Value : 0.33333
## Prevalence : 0.74571
## Detection Rate : 0.72286
## Detection Prevalence : 0.96571
## Balanced Accuracy : 0.50715
##
## 'Positive' Class : Uninjured
##
A reduced-feature model excluding VO₂ max was also evaluated.
## 6.3.3 Random Forest Variable Reduction (Remove VO2_Max)
set.seed(123)
rf_no_vo2 <- randomForest(
Injury_History ~ . - VO2_Max,
data = train_data,
ntree = 500,
importance = TRUE
)
# Variable importance
varImpPlot(
rf_no_vo2,
main = "Variable Importance – Random Forest Without VO2_Max"
)
# Predictions
rf_no_vo2_prob <- predict(rf_no_vo2, test_data, type = "prob")[, "Injured"]
rf_no_vo2_pred <- predict(rf_no_vo2, test_data)
# ROC and AUC
rf_no_vo2_roc <- roc(test_data$Injury_History, rf_no_vo2_prob)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(rf_no_vo2_roc)
## Area under the curve: 0.5024
# Confusion matrix
cm_rf_no_vo2 <- confusionMatrix(rf_no_vo2_pred, test_data$Injury_History)
cm_rf_no_vo2
## Confusion Matrix and Statistics
##
## Reference
## Prediction Uninjured Injured
## Uninjured 253 84
## Injured 8 5
##
## Accuracy : 0.7371
## 95% CI : (0.6877, 0.7825)
## No Information Rate : 0.7457
## P-Value [Acc > NIR] : 0.6692
##
## Kappa : 0.0355
##
## Mcnemar's Test P-Value : 5.312e-15
##
## Sensitivity : 0.96935
## Specificity : 0.05618
## Pos Pred Value : 0.75074
## Neg Pred Value : 0.38462
## Prevalence : 0.74571
## Detection Rate : 0.72286
## Detection Prevalence : 0.96286
## Balanced Accuracy : 0.51276
##
## 'Positive' Class : Uninjured
##
Across all Random Forest implementations, variable importance rankings consistently emphasize training load, risky movements, and recovery time, aligning with earlier exploratory findings. However, test-set performance remains modest. While specificity is high, sensitivity for injured athletes remains low, and ROC–AUC values indicate weak discrimination.
Hyperparameter tuning and feature removal improve fit to the majority class but do not substantially enhance identification of injured athletes. These results suggest that model flexibility alone cannot overcome the substantial overlap between injured and uninjured biometric profiles.
Naive Bayes was included as a probabilistic classifier that explicitly models class-conditional predictor distributions. After centering and scaling numeric predictors, the model achieves the highest AUC among the tested approaches, although the improvement is modest and only marginally exceeds chance-level discrimination.
## 6.4 Naive Bayes (Scaled Numeric Predictors)
# Identify numeric predictors (exclude outcome)
nb_numeric_vars <- numeric_vars
# Fit scaler on training data only
scaler <- preProcess(
train_data[, nb_numeric_vars],
method = c("center", "scale")
)
# Apply scaling
train_nb_scaled <- train_data
test_nb_scaled <- test_data
train_nb_scaled[, nb_numeric_vars] <- predict(scaler, train_data[, nb_numeric_vars])
test_nb_scaled[, nb_numeric_vars] <- predict(scaler, test_data[, nb_numeric_vars])
# Fit Naive Bayes model
nb_model_scaled <- naive_bayes(
Injury_History ~ .,
data = train_nb_scaled
)
# Predictions
nb_prob_scaled <- predict(nb_model_scaled, test_nb_scaled, type = "prob")[, "Injured"]
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
nb_pred_scaled <- predict(nb_model_scaled, test_nb_scaled)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
# ROC and AUC
nb_roc_scaled <- roc(test_nb_scaled$Injury_History, nb_prob_scaled)
## Setting levels: control = Uninjured, case = Injured
## Setting direction: controls > cases
auc(nb_roc_scaled)
## Area under the curve: 0.5483
# Confusion matrix
cm_nb_scaled <- confusionMatrix(nb_pred_scaled, test_nb_scaled$Injury_History)
cm_nb_scaled
## Confusion Matrix and Statistics
##
## Reference
## Prediction Uninjured Injured
## Uninjured 250 85
## Injured 11 4
##
## Accuracy : 0.7257
## 95% CI : (0.6758, 0.7718)
## No Information Rate : 0.7457
## P-Value [Acc > NIR] : 0.8218
##
## Kappa : 0.0039
##
## Mcnemar's Test P-Value : 9.297e-14
##
## Sensitivity : 0.95785
## Specificity : 0.04494
## Pos Pred Value : 0.74627
## Neg Pred Value : 0.26667
## Prevalence : 0.74571
## Detection Rate : 0.71429
## Detection Prevalence : 0.95714
## Balanced Accuracy : 0.50140
##
## 'Positive' Class : Uninjured
##
Sensitivity remains limited, likely reflecting violations of the conditional independence assumption and extensive overlap between class-conditional distributions. As with other models, performance appears constrained by data structure rather than modeling choice.
Performance metrics across all models are summarized in a unified comparison table and ROC visualization. While overall accuracy is consistently high, this metric is driven by correct classification of uninjured athletes. Sensitivity for injured athletes remains low across all approaches, and ROC–AUC values cluster near 0.5–0.6.
## 6.5.1 Model Comparison Table
model_comparison <- tibble(
Model = c(
"Logistic",
"Logistic (AIC)",
"Random Forest (Baseline)",
"Random Forest (Optimized)",
"Random Forest (No VO2)",
"Naive Bayes (Scaled)"
),
Accuracy = c(
cm_log$overall["Accuracy"],
cm_log_aic$overall["Accuracy"],
cm_rf_base$overall["Accuracy"],
cm_rf_best$overall["Accuracy"],
cm_rf_no_vo2$overall["Accuracy"],
cm_nb_scaled$overall["Accuracy"]
),
Misclassification = 1 - Accuracy,
Deviance = c(
deviance(log_model),
deviance(log_model_aic),
NA, NA, NA, NA
),
Sensitivity = c(
cm_log$byClass["Sensitivity"],
cm_log_aic$byClass["Sensitivity"],
cm_rf_base$byClass["Sensitivity"],
cm_rf_best$byClass["Sensitivity"],
cm_rf_no_vo2$byClass["Sensitivity"],
cm_nb_scaled$byClass["Sensitivity"]
),
Specificity = c(
cm_log$byClass["Specificity"],
cm_log_aic$byClass["Specificity"],
cm_rf_base$byClass["Specificity"],
cm_rf_best$byClass["Specificity"],
cm_rf_no_vo2$byClass["Specificity"],
cm_nb_scaled$byClass["Specificity"]
),
AUC = c(
auc(log_roc),
auc(log_aic_roc),
auc(rf_base_roc),
auc(rf_best_roc),
auc(rf_no_vo2_roc),
auc(nb_roc_scaled)
)
)
model_comparison
## # A tibble: 6 × 7
## Model Accuracy Misclassification Deviance Sensitivity Specificity AUC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Logistic 0.726 0.274 843. 0.973 0 0.535
## 2 Logistic (A… 0.746 0.254 859. 1 0 0.522
## 3 Random Fore… 0.743 0.257 NA 0.977 0.0562 0.503
## 4 Random Fore… 0.734 0.266 NA 0.969 0.0449 0.493
## 5 Random Fore… 0.737 0.263 NA 0.969 0.0562 0.502
## 6 Naive Bayes… 0.726 0.274 NA 0.958 0.0449 0.548
## ROC Curve Comparison Across Models
# Extract AUC values
auc_vals <- c(
Logistic = as.numeric(auc(log_roc)),
Logistic_AIC = as.numeric(auc(log_aic_roc)),
RF_Baseline = as.numeric(auc(rf_base_roc)),
RF_Optimized = as.numeric(auc(rf_best_roc)),
RF_No_VO2 = as.numeric(auc(rf_no_vo2_roc)),
Naive_Bayes = as.numeric(auc(nb_roc_scaled))
)
# Base ROC
plot(
log_roc,
col = "black",
lwd = 2,
main = "ROC Curve Comparison Across Models"
)
# Add remaining models
lines(log_aic_roc, col = "blue", lwd = 2)
lines(rf_base_roc, col = "darkgreen", lwd = 2)
lines(rf_best_roc, col = "red", lwd = 2)
lines(rf_no_vo2_roc, col = "purple", lwd = 2)
lines(nb_roc_scaled, col = "orange", lwd = 2)
# Reference line
abline(0, 1, lty = 2, col = "gray")
# Legend
legend(
"bottomright",
legend = c(
paste0("Logistic (AUC = ", round(auc_vals["Logistic"], 3), ")"),
paste0("Logistic AIC (AUC = ", round(auc_vals["Logistic_AIC"], 3), ")"),
paste0("RF Baseline (AUC = ", round(auc_vals["RF_Baseline"], 3), ")"),
paste0("RF Optimized (AUC = ", round(auc_vals["RF_Optimized"], 3), ")"),
paste0("RF No VO2 (AUC = ", round(auc_vals["RF_No_VO2"], 3), ")"),
paste0("Naive Bayes (AUC = ", round(auc_vals["Naive_Bayes"], 3), ")")
),
col = c("black", "blue", "darkgreen", "red", "purple", "orange"),
lwd = 2,
bty = "n",
cex = 0.85
)
# Save trained Naive Bayes model
save(nb_model_scaled, file = "nb_model_scaled.RData")
# Save scaler
save(scaler, file = "scaler.RData")
# Save numeric variable list
save(nb_numeric_vars, file = "nb_numeric_vars.RData")
Although Random Forest and Naive Bayes marginally outperform logistic regression, no model demonstrates strong or clinically actionable predictive capability. Collectively, these findings reinforce the conclusion that injury prediction using cross-sectional biometric data is intrinsically difficult and that improved performance will likely require richer longitudinal, biomechanical, or contextual information rather than alternative classifiers alone.
The objective of this project was to evaluate whether routinely collected biometric, behavioral, and training-related variables could be used to characterize and predict sports-related injury risk in a collegiate athlete population. Using a comprehensive analytical pipeline that integrated exploratory data analysis, dimensionality reduction, unsupervised clustering, and multiple supervised learning approaches, this study provides a balanced assessment of both the potential and the limitations of data-driven injury modeling.
Across all analytical stages, a consistent finding emerged: injury risk does not appear as a simple or separable pattern within the available data. Exploratory analyses revealed substantial overlap between injured and uninjured athletes across individual variables and multivariate relationships. Although injured athletes tended to exhibit higher training loads, greater engagement in risky movements, and reduced recovery time, these differences were gradual rather than categorical. No single variable, nor any low-dimensional combination of variables, offered strong discriminatory power. This pattern aligns with contemporary injury science, which views injury as an emergent outcome arising from cumulative stress, limited adaptation, and interacting physiological processes rather than isolated risk factors.
Dimensionality reduction and unsupervised learning analyses reinforced this interpretation. Principal Component Analysis revealed a high-dimensional variance structure, with no dominant latent axis summarizing injury-related variability. The absence of a clear elbow in the scree plot and the extensive overlap observed in PCA biplots indicate that biometric variation reflects multiple partially independent domains, including workload, fitness, recovery, and behavior. Injury status did not align cleanly with these axes, suggesting that risk is not encoded along a single physiological dimension.
Similarly, hierarchical and model-based clustering failed to identify discrete athlete subgroups corresponding to injury history. Cluster structures exhibited smooth gradients rather than sharp boundaries, and formal statistical tests showed weak or non-significant associations between cluster membership and injury status. These findings argue against stable “injury-prone” biometric phenotypes and instead support a continuum-based view of injury risk, in which athletes move dynamically between lower- and higher-risk states as training demands and recovery balance fluctuate.
Supervised predictive modeling highlighted the practical implications of this underlying data structure. Logistic regression, Random Forest, and Naive Bayes classifiers all demonstrated limited discrimination, with ROC–AUC values near chance and consistently low sensitivity for injured athletes. While overall accuracy appeared high, it was driven largely by correct classification of the uninjured majority and therefore overstated predictive performance. Across models, injured athletes remained difficult to identify using cross-sectional biometric data alone.
The modest gains achieved by more flexible models such as Random Forest and Naive Bayes indicate that performance limitations are not primarily methodological. Even when nonlinear interactions and probabilistic structure were modeled, substantial overlap between injured and uninjured predictor distributions constrained achievable performance. Variable importance analyses consistently highlighted training load, risky movements, and recovery-related measures, reinforcing their relevance for athlete monitoring. However, these variables contribute probabilistically rather than deterministically, increasing risk without defining clear decision thresholds.
These results underscore a central principle in applied machine learning: predictive performance is fundamentally limited by data structure and content. While the variables included capture meaningful aspects of physiological load and behavior, they represent only a subset of relevant injury determinants. Unmeasured factors such as biomechanical asymmetries, tissue-specific loading, psychological stress, prior injury severity, and contextual training decisions likely play important roles. Moreover, injury itself is not binary. Differences in injury type, severity, duration, and functional impact are not captured by the outcome variable used here.
The cross-sectional nature of the dataset further limits inference. Injury history is treated as a static label without information on timing relative to data collection, making it unclear whether observed biometrics reflect pre-injury risk, post-injury adaptation, or compensatory behavior. Without longitudinal tracking, acute-to-chronic workload transitions, delayed effects, and temporal risk evolution cannot be modeled, likely attenuating predictive signal.
Despite these limitations, this project makes several meaningful contributions. It demonstrates a rigorous and transparent analytical workflow, provides empirical support for conceptualizing injury risk as a continuous and multifactorial process, and consistently identifies workload, recovery, and behavioral measures as relevant monitoring signals, even when deterministic prediction is not achievable.
The interactive application developed as part of this work further enhances its translational value by enabling dynamic exploration of variable relationships and probabilistic risk estimates for hypothetical athlete profiles. While not intended as a diagnostic tool, it offers a structured framework for hypothesis generation and decision support. In applied settings, such tools are most effective when used to augment expert judgment rather than replace it.
In conclusion, this study demonstrates that predicting sports-related injuries using cross-sectional biometric data remains inherently challenging. The absence of strong predictive performance reflects the biological and contextual complexity of injury mechanisms rather than analytical shortcomings. These findings highlight the need for richer, longitudinal, and multidimensional data sources and for probabilistic risk frameworks that explicitly acknowledge uncertainty. By critically evaluating both the capabilities and limitations of biometric injury modeling, this work contributes to a more realistic and scientifically grounded understanding of how data-driven approaches can support injury prevention in collegiate athletics.
tibble(
Variable = c(numeric_vars, "Gender", "Sport", "Feedback_Level", "Injury_History"),
Description = c(
"Age in years",
"Heart rate variability",
"VO2 max",
"Speed performance index",
"Weekly endurance hours",
"Count of risky movements",
"Average sleep hours",
"Daily step count",
"Composite training load",
"Recovery time post training",
"Gender",
"Sport",
"Subjective feedback level",
"Injury status"
)
)
## # A tibble: 14 × 2
## Variable Description
## <chr> <chr>
## 1 Age Age in years
## 2 HR_Variability Heart rate variability
## 3 VO2_Max VO2 max
## 4 Speed_Index Speed performance index
## 5 Endurance_Hours Weekly endurance hours
## 6 Risky_Moves Count of risky movements
## 7 Sleep_Hours Average sleep hours
## 8 Step_Count Daily step count
## 9 Training_Load Composite training load
## 10 Recovery_Time Recovery time post training
## 11 Gender Gender
## 12 Sport Sport
## 13 Feedback_Level Subjective feedback level
## 14 Injury_History Injury status
The PSL4040 Athlete Injury Risk Explorer is an interactive R Shiny application designed to serve as a translational tool for exploring collegiate athlete data and calculating the probabilistic risk of sports-related injury.
It bridges formal statistical analysis with applied interpretation, focusing on exploration and explanation rather than serving as a definitive predictive decision tool. This design reflects the project’s finding that injury risk is a complex, multifactorial, and continuous process that resists simple, sharp classification.
Application Components and Analytical Value The application is organized into six tabs, offering a blend of data visualization, utility, and predictive modeling:
Underlying Model: It uses a pre-trained Naive Bayes Machine Learning Model (nb_model_scaled) developed in a prior analysis.
Data Inputs (Key Variables):
Biometrics/Physiological: Age, HR Variability, VO2 Max, Speed Index.
Workload/Behavioral: Endurance Hours, Risky Moves, Sleep Hours, Step Count, Training Load, Recovery Time.
Categorical: Gender, Sport, Feedback Level.
Disclaimer: The app consistently reinforces that all output represents probabilistic risk, not deterministic outcomes or a diagnosis.
Manual Athlete Entry: Users input specific values for a single athlete’s 13 variables. The app processes this input and displays the single resulting Probability of Injury in a minimal table format.
CSV File Upload: Users can upload a dataset of multiple new athletes. The app calculates the injury probability for every athlete and presents the full results in a single, interactive, searchable, and pageable table.
Scatter Plot Explorer: Dynamically plots any two numeric variables, coloring points by a categorical group (e.g., Injury History) to assess bivariate relationships, overlap, and separation (or lack thereof).
Box Plot Explorer: Compares the distribution of a single numeric variable across different categorical groups.
Categorical Distributions: Visualizes the frequency counts and relationships between categorical variables.
Analytical Value: By allowing users to explore different variable combinations, the app makes it immediately clear that injury status does not correspond to clean thresholds in biometric variables, and overlap between injured and uninjured athletes is pervasive.
Data Table: An interactive view of the raw training dataset, allowing users to sort, search, and inspect individual athlete profiles to cross-reference plotted patterns with raw values.
Glossary: Provides a detailed definition for every variable used in the application (e.g., HR Variability, Training Load).