Primary Aim
To identify which biometric, workload, sleep, and performance metrics are most strongly associated with prior injury among college athletes, and evaluate whether multivariable methods can detect high-risk patterns.
Kaggle — College Athlete Training and Performance Dataset https://www.kaggle.com/datasets/pythondeveloper007/college-athlete-training-and-performance-data
This dataset includes biometric, workload, sleep, and injury-related metrics collected from college athletes across multiple sports. While not tied to a peer-reviewed publication, it mirrors datasets commonly used in sports-medicine research with wearables and athlete monitoring systems. Such datasets allow investigation of injury risk, training load, recovery, and performance optimization.
The dataset contains 1,050 anonymized observations and 15 variables, grouped as follows:
Demographics: Age, Gender, Sport
Biometrics: HR_Variability, VO2_Max, Speed_Index
Training/Workload: Endurance_Hours, Risky_Moves, Step_Count, Training_Load
Recovery: Sleep_Hours, Recovery_Time
History & Feedback: Injury_History (0/1), Feedback_Level
Identifier: Athlete_ID (non-analytic)
This combination of physiological, performance, and workload variables makes the dataset suitable for modeling injury risk and understanding patterns of fatigue or performance deficits.
The dataset appears moderately preprocessed:
No missing values in initial inspection
Consistent variable naming
Numeric and categorical variables stored appropriately
It is not raw physiological data, but a cleaned summary dataset ready for exploratory analysis and modeling.
To identify which biometric, workload, sleep, and performance metrics are most strongly associated with prior injury among college athletes, and evaluate whether multivariable methods can detect high-risk patterns.
To develop an interactive Shiny dashboard that visualizes:
Injury prevalence across sports
Relationships between workload and recovery
Biometric-performance correlations
Athlete-level summary patterns
This extends the dataset beyond the summary provided on Kaggle and aligns with the course focus on wrangling, exploration, and presentation.
Because this dataset is not associated with a publication, the analysis will focus on producing new exploratory insights, including:
I will examine associations between Injury_History and the following variables:
Age, Gender, Sport, Feedback_Level, HR_Variability, VO2_Max, Speed_Index, Endurance_Hours, Risky_Moves, Step_Count, Training_Load, Sleep_Hours, Recovery_Time
I will also explore non-linear or interaction patterns (e.g., high training load but low sleep) to identify risk combinations not captured by univariate analysis.
PCA of biometric and workload variables to identify latent clusters representing fatigue, recovery, or performance profiles.
Designed to support the upcoming Shiny app:
Sport-specific injury rates
Workload–recovery imbalance plots
Performance correlations (e.g., VO2 Max vs Speed Index)
Logistic regression for injury association
Random forest for variable importance
ROC/AUC evaluation
This approach provides novel analysis not contained in the original data description and fits the assignment’s applied focus.
Load CSV into R using readr
Convert categorical variables to factors: Gender, Sport, Injury_History, Feedback_Level
Convert numeric variables as needed: HRV, VO2_Max, Training_Load, etc.
Inspect distributions and outliers
Prepare scaled/normalized values for PCA or modeling
Density and boxplots for workload and biometric variables (ggplot2, ggpubr)
Correlation matrices via GGally and corrplot
Injury vs non-injury comparisons
Sport-specific demographic and workload summaries
Logistic regression using glm() for injury prediction
Random forest with randomForest for variable importance
PCA for dimensionality reduction and cluster detection
Optional t-SNE or UMAP for nonlinear latent structures
UI Components:
Select sport, gender, variable type, etc.
Dynamic plots (ggplot-driven)
Summary statistics panel
Visual Modules:
Injury histogram by sport
Workload vs recovery scatterplots
HRV/VO2_Max/Speed comparisons
PCA biplot viewer
Data handling: readr, dplyr, formattable, broom, car Visualization: ggplot2, ggpubr, GGally, cowplot, corrplot, scales, DT Machine learning & statistics: caret, randomForest, mclust, Rtsne, umap Shiny app: shiny, rsconnect
These packages will allow complete exploration, modeling, visualization, and interactive dashboard creation.
Below is the required initial inspection (without printing the full dataset).
# Data handling
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(formattable)
library(broom)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
# Visualization
library(ggplot2)
library(ggpubr)
library(GGally)
library(cowplot)
##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggpubr':
##
## get_legend
library(corrplot)
## corrplot 0.95 loaded
library(scales)
##
## Attaching package: 'scales'
## The following objects are masked from 'package:formattable':
##
## comma, percent, scientific
## The following object is masked from 'package:readr':
##
## col_factor
library(DT)
# Machine learning & statistics
library(caret)
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(mclust)
## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
## The following object is masked from 'package:dplyr':
##
## count
library(Rtsne)
library(umap)
# Shiny app
library(shiny)
##
## Attaching package: 'shiny'
## The following object is masked from 'package:mclust':
##
## em
## The following objects are masked from 'package:DT':
##
## dataTableOutput, renderDataTable
library(rsconnect)
##
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
##
## serverInfo
# Read dataset
athletes <- read_csv("College_Sports_Dataset.csv")
## Rows: 1050 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Athlete_ID, Gender, Sport
## dbl (12): Age, HR_Variability, VO2_Max, Speed_Index, Endurance_Hours, Risky_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Structure & dimensions
summary(is.na(athletes))
## Athlete_ID Age Gender Sport
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1050 FALSE:1050 FALSE:1050 FALSE:1050
## HR_Variability VO2_Max Speed_Index Endurance_Hours
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1050 FALSE:1050 FALSE:1050 FALSE:1050
## Risky_Moves Sleep_Hours Step_Count Injury_History
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1050 FALSE:1050 FALSE:1050 FALSE:1050
## Training_Load Recovery_Time Feedback_Level
## Mode :logical Mode :logical Mode :logical
## FALSE:1050 FALSE:1050 FALSE:1050
apply(X = athletes, MARGIN = 2, FUN = function(x) sum(x==0))
## Athlete_ID Age Gender Sport HR_Variability
## 0 0 0 0 0
## VO2_Max Speed_Index Endurance_Hours Risky_Moves Sleep_Hours
## 0 0 0 156 0
## Step_Count Injury_History Training_Load Recovery_Time Feedback_Level
## 0 743 0 0 0
str(athletes)
## spc_tbl_ [1,050 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Athlete_ID : chr [1:1050] "A0001" "A0002" "A0003" "A0004" ...
## $ Age : num [1:1050] 24 21 22 24 20 22 22 24 19 20 ...
## $ Gender : chr [1:1050] "M" "F" "M" "M" ...
## $ Sport : chr [1:1050] "Tennis" "Volleyball" "Swimming" "Basketball" ...
## $ HR_Variability : num [1:1050] 51.4 51.2 69.6 59.5 52.2 ...
## $ VO2_Max : num [1:1050] 38.1 52.2 40.5 41.2 48.2 ...
## $ Speed_Index : num [1:1050] 6.71 5.47 5.61 7.88 6.14 5.95 5.16 7.35 5.44 6.48 ...
## $ Endurance_Hours: num [1:1050] 4.18 3.19 4.77 4.29 3.09 2.72 3.22 4.67 3.76 4.06 ...
## $ Risky_Moves : num [1:1050] 1 3 1 4 2 3 2 2 1 1 ...
## $ Sleep_Hours : num [1:1050] 7.8 6.1 6.4 7.9 6.3 6.8 6 8.8 7.5 6.7 ...
## $ Step_Count : num [1:1050] 14229 9239 14566 11391 12213 ...
## $ Injury_History : num [1:1050] 0 1 0 0 1 1 0 0 0 0 ...
## $ Training_Load : num [1:1050] 392 361 250 202 327 318 257 324 399 312 ...
## $ Recovery_Time : num [1:1050] 30.1 21 24.3 25.9 16.6 22.6 30.6 29.4 22.5 27.6 ...
## $ Feedback_Level : num [1:1050] 5 4 1 3 2 4 4 5 4 5 ...
## - attr(*, "spec")=
## .. cols(
## .. Athlete_ID = col_character(),
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Sport = col_character(),
## .. HR_Variability = col_double(),
## .. VO2_Max = col_double(),
## .. Speed_Index = col_double(),
## .. Endurance_Hours = col_double(),
## .. Risky_Moves = col_double(),
## .. Sleep_Hours = col_double(),
## .. Step_Count = col_double(),
## .. Injury_History = col_double(),
## .. Training_Load = col_double(),
## .. Recovery_Time = col_double(),
## .. Feedback_Level = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
dim(athletes)
## [1] 1050 15
# Factor variables
athletes$Athlete_ID <- as.factor(athletes$Athlete_ID)
athletes$Gender <- as.factor(athletes$Gender)
athletes$Sport <- as.factor(athletes$Sport)
athletes$Injury_History <- as.factor(athletes$Injury_History)
athletes$Feedback_Level <- as.factor(athletes$Feedback_Level)
# Numeric variables
athletes$Age <- as.numeric(athletes$Age)
athletes$HR_Variability <- as.numeric(athletes$HR_Variability)
athletes$VO2_Max <- as.numeric(athletes$VO2_Max)
athletes$Speed_Index <- as.numeric(athletes$Speed_Index)
athletes$Endurance_Hours <- as.numeric(athletes$Endurance_Hours)
athletes$Risky_Moves <- as.numeric(athletes$Risky_Moves)
athletes$Sleep_Hours <- as.numeric(athletes$Sleep_Hours)
athletes$Step_Count <- as.numeric(athletes$Step_Count)
athletes$Training_Load <- as.numeric(athletes$Training_Load)
athletes$Recovery_Time <- as.numeric(athletes$Recovery_Time)
# Convert Injury_History labels to text
levels(athletes$Injury_History) <- c("Uninjured", "Injured")
# Check changes
str(athletes)
## spc_tbl_ [1,050 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Athlete_ID : Factor w/ 1050 levels "A0001","A0002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num [1:1050] 24 21 22 24 20 22 22 24 19 20 ...
## $ Gender : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 1 1 2 ...
## $ Sport : Factor w/ 6 levels "Athletics","Basketball",..: 5 6 4 2 2 5 4 1 4 1 ...
## $ HR_Variability : num [1:1050] 51.4 51.2 69.6 59.5 52.2 ...
## $ VO2_Max : num [1:1050] 38.1 52.2 40.5 41.2 48.2 ...
## $ Speed_Index : num [1:1050] 6.71 5.47 5.61 7.88 6.14 5.95 5.16 7.35 5.44 6.48 ...
## $ Endurance_Hours: num [1:1050] 4.18 3.19 4.77 4.29 3.09 2.72 3.22 4.67 3.76 4.06 ...
## $ Risky_Moves : num [1:1050] 1 3 1 4 2 3 2 2 1 1 ...
## $ Sleep_Hours : num [1:1050] 7.8 6.1 6.4 7.9 6.3 6.8 6 8.8 7.5 6.7 ...
## $ Step_Count : num [1:1050] 14229 9239 14566 11391 12213 ...
## $ Injury_History : Factor w/ 2 levels "Uninjured","Injured": 1 2 1 1 2 2 1 1 1 1 ...
## $ Training_Load : num [1:1050] 392 361 250 202 327 318 257 324 399 312 ...
## $ Recovery_Time : num [1:1050] 30.1 21 24.3 25.9 16.6 22.6 30.6 29.4 22.5 27.6 ...
## $ Feedback_Level : Factor w/ 5 levels "1","2","3","4",..: 5 4 1 3 2 4 4 5 4 5 ...
## - attr(*, "spec")=
## .. cols(
## .. Athlete_ID = col_character(),
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Sport = col_character(),
## .. HR_Variability = col_double(),
## .. VO2_Max = col_double(),
## .. Speed_Index = col_double(),
## .. Endurance_Hours = col_double(),
## .. Risky_Moves = col_double(),
## .. Sleep_Hours = col_double(),
## .. Step_Count = col_double(),
## .. Injury_History = col_double(),
## .. Training_Load = col_double(),
## .. Recovery_Time = col_double(),
## .. Feedback_Level = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Quick summaries
summary(athletes)
## Athlete_ID Age Gender Sport HR_Variability
## A0001 : 1 Min. :18.00 F:499 Athletics :191 Min. :30.98
## A0002 : 1 1st Qu.:19.00 M:551 Basketball:182 1st Qu.:53.88
## A0003 : 1 Median :21.00 Soccer :166 Median :60.84
## A0004 : 1 Mean :20.95 Swimming :188 Mean :60.61
## A0005 : 1 3rd Qu.:23.00 Tennis :172 3rd Qu.:67.46
## A0006 : 1 Max. :24.00 Volleyball:151 Max. :92.60
## (Other):1044
## VO2_Max Speed_Index Endurance_Hours Risky_Moves
## Min. :28.80 Min. : 3.560 Min. :1.030 Min. :0.00
## 1st Qu.:41.65 1st Qu.: 5.870 1st Qu.:3.430 1st Qu.:1.00
## Median :45.02 Median : 6.500 Median :4.020 Median :2.00
## Mean :44.97 Mean : 6.519 Mean :3.992 Mean :1.99
## 3rd Qu.:48.36 3rd Qu.: 7.168 3rd Qu.:4.550 3rd Qu.:3.00
## Max. :60.61 Max. :10.160 Max. :6.270 Max. :7.00
##
## Sleep_Hours Step_Count Injury_History Training_Load
## Min. : 3.800 Min. : 8009 Uninjured:743 Min. :200.0
## 1st Qu.: 6.200 1st Qu.: 9844 Injured :307 1st Qu.:249.0
## Median : 7.000 Median :11587 Median :300.5
## Mean : 6.952 Mean :11585 Mean :300.2
## 3rd Qu.: 7.600 3rd Qu.:13446 3rd Qu.:351.8
## Max. :10.200 Max. :14989 Max. :399.0
##
## Recovery_Time Feedback_Level
## Min. : 9.90 1:237
## 1st Qu.:20.70 2:217
## Median :24.00 3:197
## Mean :24.09 4:209
## 3rd Qu.:27.50 5:190
## Max. :41.10
##
head(athletes)
## # A tibble: 6 × 15
## Athlete_ID Age Gender Sport HR_Variability VO2_Max Speed_Index
## <fct> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 A0001 24 M Tennis 51.4 38.1 6.71
## 2 A0002 21 F Volleyball 51.2 52.2 5.47
## 3 A0003 22 M Swimming 69.6 40.5 5.61
## 4 A0004 24 M Basketball 59.5 41.2 7.88
## 5 A0005 20 M Basketball 52.2 48.2 6.14
## 6 A0006 22 F Tennis 56.3 43.6 5.95
## # ℹ 8 more variables: Endurance_Hours <dbl>, Risky_Moves <dbl>,
## # Sleep_Hours <dbl>, Step_Count <dbl>, Injury_History <fct>,
## # Training_Load <dbl>, Recovery_Time <dbl>, Feedback_Level <fct>
# Numeric correlation matrix
numeric_vars <- athletes[sapply(athletes, is.numeric)]
cor(numeric_vars, use = "pairwise.complete.obs")
## Age HR_Variability VO2_Max Speed_Index
## Age 1.0000000000 0.0002211569 -0.004611034 -0.047397237
## HR_Variability 0.0002211569 1.0000000000 -0.013905167 0.020731860
## VO2_Max -0.0046110337 -0.0139051672 1.000000000 0.016928273
## Speed_Index -0.0473972374 0.0207318598 0.016928273 1.000000000
## Endurance_Hours 0.0258082080 0.0008827948 -0.045060415 0.039579236
## Risky_Moves -0.0106255523 0.0125379165 -0.050676720 0.001850767
## Sleep_Hours -0.0161458179 -0.0014301674 0.038750392 -0.019926030
## Step_Count -0.0177096008 -0.0112507096 -0.060960339 -0.060513344
## Training_Load 0.0133558943 0.0162671246 0.041871528 -0.005530085
## Recovery_Time 0.0545575593 -0.0004471802 -0.015098212 0.006220533
## Endurance_Hours Risky_Moves Sleep_Hours Step_Count
## Age 0.0258082080 -0.010625552 -0.016145818 -0.01770960
## HR_Variability 0.0008827948 0.012537917 -0.001430167 -0.01125071
## VO2_Max -0.0450604147 -0.050676720 0.038750392 -0.06096034
## Speed_Index 0.0395792361 0.001850767 -0.019926030 -0.06051334
## Endurance_Hours 1.0000000000 -0.010231591 0.047804473 0.02134761
## Risky_Moves -0.0102315912 1.000000000 -0.083213795 -0.01379306
## Sleep_Hours 0.0478044729 -0.083213795 1.000000000 0.03837657
## Step_Count 0.0213476063 -0.013793062 0.038376572 1.00000000
## Training_Load 0.0040920025 0.022766988 0.024004899 0.04379983
## Recovery_Time 0.0416111733 0.008109124 0.020227485 0.06313255
## Training_Load Recovery_Time
## Age 0.013355894 0.0545575593
## HR_Variability 0.016267125 -0.0004471802
## VO2_Max 0.041871528 -0.0150982124
## Speed_Index -0.005530085 0.0062205327
## Endurance_Hours 0.004092003 0.0416111733
## Risky_Moves 0.022766988 0.0081091238
## Sleep_Hours 0.024004899 0.0202274851
## Step_Count 0.043799834 0.0631325528
## Training_Load 1.000000000 -0.0256919089
## Recovery_Time -0.025691909 1.0000000000
# Count of each injury status
table(athletes$Injury_History)
##
## Uninjured Injured
## 743 307
# Proportion
prop.table(table(athletes$Injury_History))
##
## Uninjured Injured
## 0.707619 0.292381
Athlete_ID
A unique identifier assigned to each athlete; used for indexing, not for analysis.
Age
Age of the athlete in years.
Gender
Self-identified gender of the athlete (M/F); used as a demographic factor.
Sport
Primary sport the athlete competes in (e.g., Basketball, Tennis).
HR_Variability
Heart rate variability (ms); a biomarker reflecting autonomic nervous system balance and recovery status.
VO2_Max
Maximum oxygen uptake; a measure of aerobic fitness and endurance capacity.
Speed_Index
Composite score representing sprinting or movement speed performance.
Endurance_Hours
Weekly endurance training hours; a measure of workload volume.
Risky_Moves
Count of high-risk movements performed (e.g., jumps, pivots); represents exposure to biomechanical load.
Sleep_Hours
Average nightly sleep duration in hours.
Step_Count
Daily step count recorded via wearable device.
Injury_History
Indicator of whether the athlete has had a prior injury (0 = no, 1 = yes).
Training_Load
Composite training stress score (higher = more physical workload).
Recovery_Time
Estimated recovery time after workouts (hours); reflects fatigue and recovery.
Feedback_Level
Athlete-provided subjective rating (1–5) of their overall training session experience or readiness.
The dataset loads cleanly with no missing values, which simplifies preprocessing.
Injury history shows 743 uninjured vs. 307 injured athletes, confirming an imbalance (~71% uninjured). This suggests careful weighting or interpretation in modeling.
Variable distributions appear reasonable:
HR_Variability: 30.98 – 92.60 ms
VO2_Max: 28.80 – 60.61 ((mL/kg/min)
Speed_Index: 3.56 – 10.16
Endurance_Hours: 1.03 – 6.27
Risky_Moves: 0 – 7
Sleep_Hours: 3.8 – 10.2
Step_Count: 8,009 – 14,989
Training_Load: 200 – 399
Recovery_Time: 9.9 – 41.1 hours
Early correlations observed in numeric variables:
Training_Load positively associated with Step_Count
Sleep_Hours positively associated with Recovery_Time
VO2_Max shows slight positive association with Speed_Index
These preliminary findings justify further detailed exploration, including variable selection for predictive modeling and visualizations in the Shiny dashboard.