1. Dataset Description

Public Dataset

Kaggle — College Athlete Training and Performance Dataset https://www.kaggle.com/datasets/pythondeveloper007/college-athlete-training-and-performance-data

Context & Purpose

This dataset includes biometric, workload, sleep, and injury-related metrics collected from college athletes across multiple sports. While not tied to a peer-reviewed publication, it mirrors datasets commonly used in sports-medicine research with wearables and athlete monitoring systems. Such datasets allow investigation of injury risk, training load, recovery, and performance optimization.

Data Structure

The dataset contains 1,050 anonymized observations and 15 variables, grouped as follows:

Demographics: Age, Gender, Sport

Biometrics: HR_Variability, VO2_Max, Speed_Index

Training/Workload: Endurance_Hours, Risky_Moves, Step_Count, Training_Load

Recovery: Sleep_Hours, Recovery_Time

History & Feedback: Injury_History (0/1), Feedback_Level

Identifier: Athlete_ID (non-analytic)

This combination of physiological, performance, and workload variables makes the dataset suitable for modeling injury risk and understanding patterns of fatigue or performance deficits.

Processing Status

The dataset appears moderately preprocessed:

No missing values in initial inspection

Consistent variable naming

Numeric and categorical variables stored appropriately

It is not raw physiological data, but a cleaned summary dataset ready for exploratory analysis and modeling.

2. Research Question & Project Aim

Primary Aim

To identify which biometric, workload, sleep, and performance metrics are most strongly associated with prior injury among college athletes, and evaluate whether multivariable methods can detect high-risk patterns.

Secondary Aim

To develop an interactive Shiny dashboard that visualizes:

Injury prevalence across sports

Relationships between workload and recovery

Biometric-performance correlations

Athlete-level summary patterns

This extends the dataset beyond the summary provided on Kaggle and aligns with the course focus on wrangling, exploration, and presentation.

3. Approach

Because this dataset is not associated with a publication, the analysis will focus on producing new exploratory insights, including:

3.1 Multivariable Exploration

I will examine associations between Injury_History and the following variables:

Age, Gender, Sport, Feedback_Level, HR_Variability, VO2_Max, Speed_Index, Endurance_Hours, Risky_Moves, Step_Count, Training_Load, Sleep_Hours, Recovery_Time

I will also explore non-linear or interaction patterns (e.g., high training load but low sleep) to identify risk combinations not captured by univariate analysis.

3.2 Dimensionality Reduction

PCA of biometric and workload variables to identify latent clusters representing fatigue, recovery, or performance profiles.

3.3 Visualization-Focused Analysis

Designed to support the upcoming Shiny app:

Sport-specific injury rates

Workload–recovery imbalance plots

Performance correlations (e.g., VO2 Max vs Speed Index)

3.4 Exploratory Prediction Models (Not for clinical use)

Logistic regression for injury association

Random forest for variable importance

ROC/AUC evaluation

This approach provides novel analysis not contained in the original data description and fits the assignment’s applied focus.

4. Proposed Workflow

4.1 Data Import & Wrangling

Load CSV into R using readr

Convert categorical variables to factors: Gender, Sport, Injury_History, Feedback_Level

Convert numeric variables as needed: HRV, VO2_Max, Training_Load, etc.

Inspect distributions and outliers

Prepare scaled/normalized values for PCA or modeling

4.2 Exploratory Visualization

Density and boxplots for workload and biometric variables (ggplot2, ggpubr)

Correlation matrices via GGally and corrplot

Injury vs non-injury comparisons

Sport-specific demographic and workload summaries

4.3 Statistical & Predictive Methods

Logistic regression using glm() for injury prediction

Random forest with randomForest for variable importance

PCA for dimensionality reduction and cluster detection

Optional t-SNE or UMAP for nonlinear latent structures

4.4 Shiny App Outline

UI Components:

Select sport, gender, variable type, etc.

Dynamic plots (ggplot-driven)

Summary statistics panel

Visual Modules:

Injury histogram by sport

Workload vs recovery scatterplots

HRV/VO2_Max/Speed comparisons

PCA biplot viewer

5. Workflow-Specific Packages

Data handling: readr, dplyr, formattable, broom, car Visualization: ggplot2, ggpubr, GGally, cowplot, corrplot, scales, DT Machine learning & statistics: caret, randomForest, mclust, Rtsne, umap Shiny app: shiny, rsconnect

These packages will allow complete exploration, modeling, visualization, and interactive dashboard creation.

6. Preliminary Look at the Data

Below is the required initial inspection (without printing the full dataset).

# Data handling
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(formattable)
library(broom)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Visualization
library(ggplot2)
library(ggpubr)
library(GGally)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggpubr':
## 
##     get_legend

library(corrplot)

## corrplot 0.95 loaded

library(scales)

## 
## Attaching package: 'scales'

## The following objects are masked from 'package:formattable':
## 
##     comma, percent, scientific

## The following object is masked from 'package:readr':
## 
##     col_factor

library(DT)

# Machine learning & statistics
library(caret)

## Loading required package: lattice

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(mclust)

## Package 'mclust' version 6.1.2
## Type 'citation("mclust")' for citing this R package in publications.

## 
## Attaching package: 'mclust'

## The following object is masked from 'package:dplyr':
## 
##     count

library(Rtsne)
library(umap)

# Shiny app
library(shiny)

## 
## Attaching package: 'shiny'

## The following object is masked from 'package:mclust':
## 
##     em

## The following objects are masked from 'package:DT':
## 
##     dataTableOutput, renderDataTable

library(rsconnect)

## 
## Attaching package: 'rsconnect'

## The following object is masked from 'package:shiny':
## 
##     serverInfo

# Read dataset
athletes <- read_csv("College_Sports_Dataset.csv")

## Rows: 1050 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Athlete_ID, Gender, Sport
## dbl (12): Age, HR_Variability, VO2_Max, Speed_Index, Endurance_Hours, Risky_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Structure & dimensions
summary(is.na(athletes))

##  Athlete_ID         Age            Gender          Sport        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1050      FALSE:1050      FALSE:1050      FALSE:1050     
##  HR_Variability   VO2_Max        Speed_Index     Endurance_Hours
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1050      FALSE:1050      FALSE:1050      FALSE:1050     
##  Risky_Moves     Sleep_Hours     Step_Count      Injury_History 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1050      FALSE:1050      FALSE:1050      FALSE:1050     
##  Training_Load   Recovery_Time   Feedback_Level 
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1050      FALSE:1050      FALSE:1050

apply(X = athletes, MARGIN = 2, FUN = function(x) sum(x==0))

##      Athlete_ID             Age          Gender           Sport  HR_Variability 
##               0               0               0               0               0 
##         VO2_Max     Speed_Index Endurance_Hours     Risky_Moves     Sleep_Hours 
##               0               0               0             156               0 
##      Step_Count  Injury_History   Training_Load   Recovery_Time  Feedback_Level 
##               0             743               0               0               0

str(athletes)

## spc_tbl_ [1,050 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Athlete_ID     : chr [1:1050] "A0001" "A0002" "A0003" "A0004" ...
##  $ Age            : num [1:1050] 24 21 22 24 20 22 22 24 19 20 ...
##  $ Gender         : chr [1:1050] "M" "F" "M" "M" ...
##  $ Sport          : chr [1:1050] "Tennis" "Volleyball" "Swimming" "Basketball" ...
##  $ HR_Variability : num [1:1050] 51.4 51.2 69.6 59.5 52.2 ...
##  $ VO2_Max        : num [1:1050] 38.1 52.2 40.5 41.2 48.2 ...
##  $ Speed_Index    : num [1:1050] 6.71 5.47 5.61 7.88 6.14 5.95 5.16 7.35 5.44 6.48 ...
##  $ Endurance_Hours: num [1:1050] 4.18 3.19 4.77 4.29 3.09 2.72 3.22 4.67 3.76 4.06 ...
##  $ Risky_Moves    : num [1:1050] 1 3 1 4 2 3 2 2 1 1 ...
##  $ Sleep_Hours    : num [1:1050] 7.8 6.1 6.4 7.9 6.3 6.8 6 8.8 7.5 6.7 ...
##  $ Step_Count     : num [1:1050] 14229 9239 14566 11391 12213 ...
##  $ Injury_History : num [1:1050] 0 1 0 0 1 1 0 0 0 0 ...
##  $ Training_Load  : num [1:1050] 392 361 250 202 327 318 257 324 399 312 ...
##  $ Recovery_Time  : num [1:1050] 30.1 21 24.3 25.9 16.6 22.6 30.6 29.4 22.5 27.6 ...
##  $ Feedback_Level : num [1:1050] 5 4 1 3 2 4 4 5 4 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Athlete_ID = col_character(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Sport = col_character(),
##   ..   HR_Variability = col_double(),
##   ..   VO2_Max = col_double(),
##   ..   Speed_Index = col_double(),
##   ..   Endurance_Hours = col_double(),
##   ..   Risky_Moves = col_double(),
##   ..   Sleep_Hours = col_double(),
##   ..   Step_Count = col_double(),
##   ..   Injury_History = col_double(),
##   ..   Training_Load = col_double(),
##   ..   Recovery_Time = col_double(),
##   ..   Feedback_Level = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

dim(athletes)

## [1] 1050   15

# Factor variables

athletes$Athlete_ID     <- as.factor(athletes$Athlete_ID)
athletes$Gender         <- as.factor(athletes$Gender)
athletes$Sport          <- as.factor(athletes$Sport)
athletes$Injury_History <- as.factor(athletes$Injury_History)
athletes$Feedback_Level <- as.factor(athletes$Feedback_Level)

# Numeric variables

athletes$Age             <- as.numeric(athletes$Age)
athletes$HR_Variability  <- as.numeric(athletes$HR_Variability)
athletes$VO2_Max         <- as.numeric(athletes$VO2_Max)
athletes$Speed_Index     <- as.numeric(athletes$Speed_Index)
athletes$Endurance_Hours <- as.numeric(athletes$Endurance_Hours)
athletes$Risky_Moves     <- as.numeric(athletes$Risky_Moves)
athletes$Sleep_Hours     <- as.numeric(athletes$Sleep_Hours)
athletes$Step_Count      <- as.numeric(athletes$Step_Count)
athletes$Training_Load   <- as.numeric(athletes$Training_Load)
athletes$Recovery_Time   <- as.numeric(athletes$Recovery_Time)

# Convert Injury_History labels to text
levels(athletes$Injury_History) <- c("Uninjured", "Injured")

# Check changes
str(athletes)

## spc_tbl_ [1,050 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Athlete_ID     : Factor w/ 1050 levels "A0001","A0002",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Age            : num [1:1050] 24 21 22 24 20 22 22 24 19 20 ...
##  $ Gender         : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 1 1 1 2 ...
##  $ Sport          : Factor w/ 6 levels "Athletics","Basketball",..: 5 6 4 2 2 5 4 1 4 1 ...
##  $ HR_Variability : num [1:1050] 51.4 51.2 69.6 59.5 52.2 ...
##  $ VO2_Max        : num [1:1050] 38.1 52.2 40.5 41.2 48.2 ...
##  $ Speed_Index    : num [1:1050] 6.71 5.47 5.61 7.88 6.14 5.95 5.16 7.35 5.44 6.48 ...
##  $ Endurance_Hours: num [1:1050] 4.18 3.19 4.77 4.29 3.09 2.72 3.22 4.67 3.76 4.06 ...
##  $ Risky_Moves    : num [1:1050] 1 3 1 4 2 3 2 2 1 1 ...
##  $ Sleep_Hours    : num [1:1050] 7.8 6.1 6.4 7.9 6.3 6.8 6 8.8 7.5 6.7 ...
##  $ Step_Count     : num [1:1050] 14229 9239 14566 11391 12213 ...
##  $ Injury_History : Factor w/ 2 levels "Uninjured","Injured": 1 2 1 1 2 2 1 1 1 1 ...
##  $ Training_Load  : num [1:1050] 392 361 250 202 327 318 257 324 399 312 ...
##  $ Recovery_Time  : num [1:1050] 30.1 21 24.3 25.9 16.6 22.6 30.6 29.4 22.5 27.6 ...
##  $ Feedback_Level : Factor w/ 5 levels "1","2","3","4",..: 5 4 1 3 2 4 4 5 4 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Athlete_ID = col_character(),
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   Sport = col_character(),
##   ..   HR_Variability = col_double(),
##   ..   VO2_Max = col_double(),
##   ..   Speed_Index = col_double(),
##   ..   Endurance_Hours = col_double(),
##   ..   Risky_Moves = col_double(),
##   ..   Sleep_Hours = col_double(),
##   ..   Step_Count = col_double(),
##   ..   Injury_History = col_double(),
##   ..   Training_Load = col_double(),
##   ..   Recovery_Time = col_double(),
##   ..   Feedback_Level = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Quick summaries

summary(athletes)

##    Athlete_ID        Age        Gender         Sport     HR_Variability 
##  A0001  :   1   Min.   :18.00   F:499   Athletics :191   Min.   :30.98  
##  A0002  :   1   1st Qu.:19.00   M:551   Basketball:182   1st Qu.:53.88  
##  A0003  :   1   Median :21.00           Soccer    :166   Median :60.84  
##  A0004  :   1   Mean   :20.95           Swimming  :188   Mean   :60.61  
##  A0005  :   1   3rd Qu.:23.00           Tennis    :172   3rd Qu.:67.46  
##  A0006  :   1   Max.   :24.00           Volleyball:151   Max.   :92.60  
##  (Other):1044                                                           
##     VO2_Max       Speed_Index     Endurance_Hours  Risky_Moves  
##  Min.   :28.80   Min.   : 3.560   Min.   :1.030   Min.   :0.00  
##  1st Qu.:41.65   1st Qu.: 5.870   1st Qu.:3.430   1st Qu.:1.00  
##  Median :45.02   Median : 6.500   Median :4.020   Median :2.00  
##  Mean   :44.97   Mean   : 6.519   Mean   :3.992   Mean   :1.99  
##  3rd Qu.:48.36   3rd Qu.: 7.168   3rd Qu.:4.550   3rd Qu.:3.00  
##  Max.   :60.61   Max.   :10.160   Max.   :6.270   Max.   :7.00  
##                                                                 
##   Sleep_Hours       Step_Count      Injury_History Training_Load  
##  Min.   : 3.800   Min.   : 8009   Uninjured:743    Min.   :200.0  
##  1st Qu.: 6.200   1st Qu.: 9844   Injured  :307    1st Qu.:249.0  
##  Median : 7.000   Median :11587                    Median :300.5  
##  Mean   : 6.952   Mean   :11585                    Mean   :300.2  
##  3rd Qu.: 7.600   3rd Qu.:13446                    3rd Qu.:351.8  
##  Max.   :10.200   Max.   :14989                    Max.   :399.0  
##                                                                   
##  Recovery_Time   Feedback_Level
##  Min.   : 9.90   1:237         
##  1st Qu.:20.70   2:217         
##  Median :24.00   3:197         
##  Mean   :24.09   4:209         
##  3rd Qu.:27.50   5:190         
##  Max.   :41.10                 
##

head(athletes)

## # A tibble: 6 × 15
##   Athlete_ID   Age Gender Sport      HR_Variability VO2_Max Speed_Index
##   <fct>      <dbl> <fct>  <fct>               <dbl>   <dbl>       <dbl>
## 1 A0001         24 M      Tennis               51.4    38.1        6.71
## 2 A0002         21 F      Volleyball           51.2    52.2        5.47
## 3 A0003         22 M      Swimming             69.6    40.5        5.61
## 4 A0004         24 M      Basketball           59.5    41.2        7.88
## 5 A0005         20 M      Basketball           52.2    48.2        6.14
## 6 A0006         22 F      Tennis               56.3    43.6        5.95
## # ℹ 8 more variables: Endurance_Hours <dbl>, Risky_Moves <dbl>,
## #   Sleep_Hours <dbl>, Step_Count <dbl>, Injury_History <fct>,
## #   Training_Load <dbl>, Recovery_Time <dbl>, Feedback_Level <fct>

# Numeric correlation matrix
numeric_vars <- athletes[sapply(athletes, is.numeric)]
cor(numeric_vars, use = "pairwise.complete.obs")

##                           Age HR_Variability      VO2_Max  Speed_Index
## Age              1.0000000000   0.0002211569 -0.004611034 -0.047397237
## HR_Variability   0.0002211569   1.0000000000 -0.013905167  0.020731860
## VO2_Max         -0.0046110337  -0.0139051672  1.000000000  0.016928273
## Speed_Index     -0.0473972374   0.0207318598  0.016928273  1.000000000
## Endurance_Hours  0.0258082080   0.0008827948 -0.045060415  0.039579236
## Risky_Moves     -0.0106255523   0.0125379165 -0.050676720  0.001850767
## Sleep_Hours     -0.0161458179  -0.0014301674  0.038750392 -0.019926030
## Step_Count      -0.0177096008  -0.0112507096 -0.060960339 -0.060513344
## Training_Load    0.0133558943   0.0162671246  0.041871528 -0.005530085
## Recovery_Time    0.0545575593  -0.0004471802 -0.015098212  0.006220533
##                 Endurance_Hours  Risky_Moves  Sleep_Hours  Step_Count
## Age                0.0258082080 -0.010625552 -0.016145818 -0.01770960
## HR_Variability     0.0008827948  0.012537917 -0.001430167 -0.01125071
## VO2_Max           -0.0450604147 -0.050676720  0.038750392 -0.06096034
## Speed_Index        0.0395792361  0.001850767 -0.019926030 -0.06051334
## Endurance_Hours    1.0000000000 -0.010231591  0.047804473  0.02134761
## Risky_Moves       -0.0102315912  1.000000000 -0.083213795 -0.01379306
## Sleep_Hours        0.0478044729 -0.083213795  1.000000000  0.03837657
## Step_Count         0.0213476063 -0.013793062  0.038376572  1.00000000
## Training_Load      0.0040920025  0.022766988  0.024004899  0.04379983
## Recovery_Time      0.0416111733  0.008109124  0.020227485  0.06313255
##                 Training_Load Recovery_Time
## Age               0.013355894  0.0545575593
## HR_Variability    0.016267125 -0.0004471802
## VO2_Max           0.041871528 -0.0150982124
## Speed_Index      -0.005530085  0.0062205327
## Endurance_Hours   0.004092003  0.0416111733
## Risky_Moves       0.022766988  0.0081091238
## Sleep_Hours       0.024004899  0.0202274851
## Step_Count        0.043799834  0.0631325528
## Training_Load     1.000000000 -0.0256919089
## Recovery_Time    -0.025691909  1.0000000000

# Count of each injury status
table(athletes$Injury_History)

## 
## Uninjured   Injured 
##       743       307

# Proportion
prop.table(table(athletes$Injury_History))

## 
## Uninjured   Injured 
##  0.707619  0.292381

Variable Descriptions

Athlete_ID

A unique identifier assigned to each athlete; used for indexing, not for analysis.

Age

Age of the athlete in years.

Gender

Self-identified gender of the athlete (M/F); used as a demographic factor.

Sport

Primary sport the athlete competes in (e.g., Basketball, Tennis).

HR_Variability

Heart rate variability (ms); a biomarker reflecting autonomic nervous system balance and recovery status.

VO2_Max

Maximum oxygen uptake; a measure of aerobic fitness and endurance capacity.

Speed_Index

Composite score representing sprinting or movement speed performance.

Endurance_Hours

Weekly endurance training hours; a measure of workload volume.

Risky_Moves

Count of high-risk movements performed (e.g., jumps, pivots); represents exposure to biomechanical load.

Sleep_Hours

Average nightly sleep duration in hours.

Step_Count

Daily step count recorded via wearable device.

Injury_History

Indicator of whether the athlete has had a prior injury (0 = no, 1 = yes).

Training_Load

Composite training stress score (higher = more physical workload).

Recovery_Time

Estimated recovery time after workouts (hours); reflects fatigue and recovery.

Feedback_Level

Athlete-provided subjective rating (1–5) of their overall training session experience or readiness.

7. Initial Observations

The dataset loads cleanly with no missing values, which simplifies preprocessing.

Injury history shows 743 uninjured vs. 307 injured athletes, confirming an imbalance (~71% uninjured). This suggests careful weighting or interpretation in modeling.

Variable distributions appear reasonable:

HR_Variability: 30.98 – 92.60 ms

VO2_Max: 28.80 – 60.61 ((mL/kg/min)

Speed_Index: 3.56 – 10.16

Endurance_Hours: 1.03 – 6.27

Risky_Moves: 0 – 7

Sleep_Hours: 3.8 – 10.2

Step_Count: 8,009 – 14,989

Training_Load: 200 – 399

Recovery_Time: 9.9 – 41.1 hours

Early correlations observed in numeric variables:

Training_Load positively associated with Step_Count

Sleep_Hours positively associated with Recovery_Time

VO2_Max shows slight positive association with Speed_Index

These preliminary findings justify further detailed exploration, including variable selection for predictive modeling and visualizations in the Shiny dashboard.

PSL4040 - Proposal Document

Michael Sava

2025-11-18