Injury Prediction

knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Introduction

To concern player safety and injury prevention in competitive sports, the dataset designed specifically for injury prediction. The Dataset representation of player health and injury data. Our synthetic dataset captures critical attributes such as player demographics, training intensities, recovery times, and previous injury histories. We establish correlations between these features and the likelihood of future injuries to accurately simulate real-world scenarios.

In this project of Programming for Data Science with R, we would like to make a predict model which performing predictions for players safety and injury prevention in competitive sports based on the categories of several supporting variables. The algorithms I will use are logistic regression and k-nearest neighbors. This data collected from kaggle.

Data Preparation

Import Library

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(ggcorrplot)

Load Dataset

injury <- read.csv("datainput/injury_data.csv")
head(injury)
#>   Player_Age Player_Weight Player_Height Previous_Injuries Training_Intensity
#> 1         24      66.25193      175.7324                 1          0.4579290
#> 2         37      70.99627      174.5817                 0          0.2265216
#> 3         32      80.09378      186.3296                 0          0.6139703
#> 4         28      87.47327      175.5042                 1          0.2528581
#> 5         25      84.65922      190.1750                 0          0.5776318
#> 6         38      75.82055      206.6318                 1          0.3592087
#>   Recovery_Time Likelihood_of_Injury
#> 1             5                    0
#> 2             6                    1
#> 3             2                    1
#> 4             4                    1
#> 5             1                    1
#> 6             4                    0

Column Description

Predictor Variabels

  • Player_Age : Age of the player in years.
  • Player_Weight : Weight of the player in kilograms, following a normal distribution with mean 75 and standard deviation 10.
  • Player_Height : Height of the player in centimeters, following a normal distribution with mean 180 and standard deviation 10.
  • Previous_Injuries : Binary indicator (0 or 1) representing whether the player has had previous injuries (1) or not (0).
  • Training_Intensity : A value between 0 and 1 representing the intensity of the player’s training regimen.
  • Recovery_Time : Number of days required for the player to recover from an injury, ranging from 1 to 6 days.

Target Variabel

  • Likelihood_of_Injury : Binary indicator (0 or 1) representing the likelihood of the player experiencing an injury (1) or not (0).

Data Processing

Check General Data Information

glimpse(injury)
#> Rows: 1,000
#> Columns: 7
#> $ Player_Age           <int> 24, 37, 32, 28, 25, 38, 24, 36, 28, 28, 38, 21, 2…
#> $ Player_Weight        <dbl> 66.25193, 70.99627, 80.09378, 87.47327, 84.65922,…
#> $ Player_Height        <dbl> 175.7324, 174.5817, 186.3296, 175.5042, 190.1750,…
#> $ Previous_Injuries    <int> 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0…
#> $ Training_Intensity   <dbl> 0.45792899, 0.22652163, 0.61397031, 0.25285812, 0…
#> $ Recovery_Time        <int> 5, 6, 2, 4, 1, 4, 2, 3, 1, 1, 3, 4, 5, 4, 2, 6, 5…
#> $ Likelihood_of_Injury <int> 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1…

From the dataset above, the data has 7 columns, 1000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.

Missing Value

colSums(is.na(injury))
#>           Player_Age        Player_Weight        Player_Height 
#>                    0                    0                    0 
#>    Previous_Injuries   Training_Intensity        Recovery_Time 
#>                    0                    0                    0 
#> Likelihood_of_Injury 
#>                    0

In the dataset above, has no missing value data in any columns.

Duplicates

sum(duplicated(injury))
#> [1] 0

In the dataset above, has no duplicates data in any columns.

Checking Target Variabels Class

prop.table(table(injury$Likelihood_of_Injury))
#> 
#>   0   1 
#> 0.5 0.5
table(injury$Likelihood_of_Injury)
#> 
#>   0   1 
#> 500 500

Checking the proportion of the target variable class “Likelihood_of_Injury,” it was found that the class proportions are 50% for “0” (not-injury) and 50% for class “1” (injury).

Data Cleaning

injury_clean <- injury %>% 
  mutate(Likelihood_of_Injury  = as.factor(Likelihood_of_Injury))# change data type chr to factor

Modelling

Logistic Regression

Logistic regression is a statistical model used for binary classification tasks, where the outcome variable (dependent variable) is categorical and has two possible outcomes, typically labeled as 0 and 1. It is an extension of linear regression, but instead of predicting the value of a continuous variable, logistic regression predicts the probability that an observation belongs to a certain category.

Train_Test Split Data

Before we make the predict model, we should to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.

set.seed(221)
index <- sample(nrow(injury_clean), nrow(injury_clean)*0.8)
injury_train <- injury_clean[index, ]
injury_test <- injury_clean[-index, ]
injury_clean$Likelihood_of_Injury %>% 
  levels()
#> [1] "0" "1"
model_lgr <- glm(formula = Likelihood_of_Injury ~., family = "binomial", 
             data = injury_train)
summary(model_lgr)
#> 
#> Call:
#> glm(formula = Likelihood_of_Injury ~ ., family = "binomial", 
#>     data = injury_train)
#> 
#> Coefficients:
#>                     Estimate Std. Error z value Pr(>|z|)  
#> (Intercept)        -1.375709   1.427224  -0.964   0.3351  
#> Player_Age          0.001195   0.010862   0.110   0.9124  
#> Player_Weight       0.000704   0.007292   0.097   0.9231  
#> Player_Height       0.006295   0.007324   0.860   0.3900  
#> Previous_Injuries   0.114073   0.142648   0.800   0.4239  
#> Training_Intensity  0.571054   0.249108   2.292   0.0219 *
#> Recovery_Time      -0.046091   0.042203  -1.092   0.2748  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1108.9  on 799  degrees of freedom
#> Residual deviance: 1100.6  on 793  degrees of freedom
#> AIC: 1114.6
#> 
#> Number of Fisher Scoring iterations: 4

Prediction

Based oh the model described above “model_lgr” , We will try to make predictions using the test data that we already possess.

injury_test$prob_injury<-predict(model_lgr, type = "response", newdata = injury_test)
head(injury_test$prob_injury)
#> [1] 0.4877680 0.4746042 0.5486059 0.5769343 0.4824800 0.4965684

Now, We will try to observe the distribution of prediction probabilities for the data.

ggplot(injury_test, aes(x=prob_injury)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

In the graph above, it can be interpreted that the prediction results tend to cluster towards the middle, indicating that some players experience injury while others do not.

injury_test$pred_injury <- factor(ifelse(injury_test$prob_injury > 0.5, "1","0"))
injury_test[1:10, c("pred_injury", "Likelihood_of_Injury")]
#>    pred_injury Likelihood_of_Injury
#> 1            0                    0
#> 4            0                    1
#> 7            1                    0
#> 8            1                    1
#> 11           0                    0
#> 17           0                    1
#> 19           0                    0
#> 21           1                    0
#> 23           1                    0
#> 33           1                    0

In the syntax above, when the probability of the test data > 0.5, it means that the player “1 = not-injury” or “0 = Injury.”

Evaluation Model

To evaluate the model we have built, we will use a confusion matrix

library(caret)
log_conf <- confusionMatrix(injury_test$pred_injury, injury_test$Likelihood_of_Injury, positive = "1")
log_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 54 45
#>          1 51 50
#>                                          
#>                Accuracy : 0.52           
#>                  95% CI : (0.4484, 0.591)
#>     No Information Rate : 0.525          
#>     P-Value [Acc > NIR] : 0.5845         
#>                                          
#>                   Kappa : 0.0405         
#>                                          
#>  Mcnemar's Test P-Value : 0.6098         
#>                                          
#>             Sensitivity : 0.5263         
#>             Specificity : 0.5143         
#>          Pos Pred Value : 0.4950         
#>          Neg Pred Value : 0.5455         
#>              Prevalence : 0.4750         
#>          Detection Rate : 0.2500         
#>    Detection Prevalence : 0.5050         
#>       Balanced Accuracy : 0.5203         
#>                                          
#>        'Positive' Class : 1              
#> 
  • Recall/Sensitivity : Out of all actual positive data, how the model is able correctly predict the proportion.
  • Specificity : Out of all actual negative data, how the model is able to correctly predict the proportion.
  • Accuracy : How the model is able to correctly predict the target Y.
  • Precision : Out of all the predictions made, how the model is able to correctly predict the positive class.
Recall <- round((50)/(50+45),2)
Specificity <- round((54)/(54+51),2)
Accuracy <- round((50+54)/(51+50+45+54),2)
Precision <- round((50)/(50+51),2)

performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance
#>   Accuracy Recall Precision Specificity
#> 1     0.52   0.53       0.5        0.51

Based on the above confusion matrix results, we got information that:

  • the model’s ability to predict the target Y (injury and not injury) is 52%.
  • out of all actual “not injury” data, the model is able to correctly predict 51%.
  • Out of all actual “injury” player data, the model is capable of correctly predicting 53%.
  • Out of all the predictions made by the model, it is able to correctly predict the positive class at 50%.

Tunning Model

In this case, we will focus on the “recall” metric due to the priority predictions is accurately predict actual injured players’ data.

For logistic regression model tuning stage, we will increase the recall value by adjusting the prediction threshold.

# meningkatkan recall
injury_test$pred_injury_new <- ifelse(injury_test$prob_injury > 0.45, "1","0")
injury_test[1:10, c("pred_injury_new", "Likelihood_of_Injury")]
#>    pred_injury_new Likelihood_of_Injury
#> 1                1                    0
#> 4                1                    1
#> 7                1                    0
#> 8                1                    1
#> 11               1                    0
#> 17               1                    1
#> 19               0                    0
#> 21               1                    0
#> 23               1                    0
#> 33               1                    0
library(caret)
log_conf2 <- confusionMatrix(data = as.factor(injury_test$pred_injury_new), injury_test$Likelihood_of_Injury, positive = "1")
log_conf2
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 26 12
#>          1 79 83
#>                                           
#>                Accuracy : 0.545           
#>                  95% CI : (0.4733, 0.6154)
#>     No Information Rate : 0.525           
#>     P-Value [Acc > NIR] : 0.3105          
#>                                           
#>                   Kappa : 0.1174          
#>                                           
#>  Mcnemar's Test P-Value : 4.559e-12       
#>                                           
#>             Sensitivity : 0.8737          
#>             Specificity : 0.2476          
#>          Pos Pred Value : 0.5123          
#>          Neg Pred Value : 0.6842          
#>              Prevalence : 0.4750          
#>          Detection Rate : 0.4150          
#>    Detection Prevalence : 0.8100          
#>       Balanced Accuracy : 0.5607          
#>                                           
#>        'Positive' Class : 1               
#> 
Recall <- round((83)/(83+12),2)
Specificity <- round((26)/(26+79),2)
Accuracy <- round((26+83)/(26+83+12+79),2)
Precision <- round((83)/(83+79),2)

performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance
#>   Accuracy Recall Precision Specificity
#> 1     0.54   0.87      0.51        0.25

Based on the tuning results above, by reduce the prediction threshold from 0.50 to 0.45, we got a better recall result of 0.87. It’s means the model is able to correctly predict 87% of the cases.

K-Nearest Neighbour

The k-NN (K-nearest neighbors) algorithm classifies new data by comparing the characteristics of the new data (test data) with the existing data (training data).

Pre-Processing Data

Checking range predictor variable

# cek range nilai tiap variable
summary(injury_clean)
#>    Player_Age    Player_Weight    Player_Height   Previous_Injuries
#>  Min.   :18.00   Min.   : 40.19   Min.   :145.3   Min.   :0.000    
#>  1st Qu.:22.00   1st Qu.: 67.94   1st Qu.:173.0   1st Qu.:0.000    
#>  Median :28.00   Median : 75.02   Median :180.0   Median :1.000    
#>  Mean   :28.23   Mean   : 74.79   Mean   :179.8   Mean   :0.515    
#>  3rd Qu.:34.00   3rd Qu.: 81.30   3rd Qu.:186.6   3rd Qu.:1.000    
#>  Max.   :39.00   Max.   :104.65   Max.   :207.3   Max.   :1.000    
#>  Training_Intensity  Recovery_Time   Likelihood_of_Injury
#>  Min.   :0.0000307   Min.   :1.000   0:500               
#>  1st Qu.:0.2410415   1st Qu.:2.000   1:500               
#>  Median :0.4839116   Median :4.000                       
#>  Mean   :0.4905380   Mean   :3.466                       
#>  3rd Qu.:0.7304045   3rd Qu.:5.000                       
#>  Max.   :0.9977494   Max.   :6.000

The range of each variable differs, hence it is necessary to perform feature scaling.

Cross Validation

set.seed(123)

# index sampling
index2 <- sample(x = nrow(injury_clean),
                size = nrow(injury_clean) * 0.8 ) #  0.8 (80%) for training data, 0.2 (20%) for testing data

# splitting
injury2_train <- injury_clean[index2, ] 
injury2_test <- injury_clean[-index2, ] 

Separate Predictor & Target

This step is performed to separate predictor data from target data.

library(dplyr)
# predictor
injury_train_x <- injury2_train %>% select_if(is.numeric)
injury_test_x <- injury2_test %>% select_if(is.numeric)

# target
injury_train_y <- injury2_train[,"Likelihood_of_Injury"]
injury_test_y <- injury2_test[,"Likelihood_of_Injury"]

Scalling Data

Scaling serves to standardize the range of predictor variables.

# train data scaling 
injury_train_xs <- scale(x = injury_train_x)
injury_train_xs %>% head(3)
#>     Player_Age Player_Weight Player_Height Previous_Injuries Training_Intensity
#> 415  -1.537882     0.3128078    -0.1679504         1.0068984         0.85272772
#> 463   1.177026     0.3612517     1.0335977        -0.9919074        -0.04329597
#> 179   1.327854    -1.5265176    -1.8343178         1.0068984         1.35058678
#>     Recovery_Time
#> 415    -1.4474063
#> 463     1.5019697
#> 179     0.9120945
# test data scaling data

injury_test_xs <- scale(x = injury_test_x,
                      center = attr(injury_train_xs, "scaled:center"), # nilai rata-rata 
                      scale = attr(injury_train_xs, "scaled:scale")) # nilai standar deviasi
injury_test_xs %>% head(3)
#>   Player_Age Player_Weight Player_Height Previous_Injuries Training_Intensity
#> 1 -0.6329129    -0.8411877    -0.4361923         1.0068984         -0.1123648
#> 3  0.5737128     0.5474231     0.6349032        -0.9919074          0.4406955
#> 7 -0.6329129    -0.4525372    -0.3035678        -0.9919074          1.1835209
#>   Recovery_Time
#> 1     0.9120945
#> 3    -0.8575311
#> 7    -0.8575311

Predict

# find optimum k
sqrt(nrow(injury_train_x))
#> [1] 28.28427
library(class) # package `knn()`

injury_pred_knn <- knn(train = injury_train_xs, 
                 test = injury_test_xs, 
                 cl = injury_train_y,
                 k = 28) 

head(injury_pred_knn, 10)
#>  [1] 1 1 1 1 0 1 1 0 1 1
#> Levels: 0 1

Model evaluation

# confusion matrix
library(caret)

confusionMatrix(data = injury_pred_knn,
                reference = injury_test_y,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 44 39
#>          1 58 59
#>                                           
#>                Accuracy : 0.515           
#>                  95% CI : (0.4435, 0.5861)
#>     No Information Rate : 0.51            
#>     P-Value [Acc > NIR] : 0.47201         
#>                                           
#>                   Kappa : 0.0333          
#>                                           
#>  Mcnemar's Test P-Value : 0.06761         
#>                                           
#>             Sensitivity : 0.6020          
#>             Specificity : 0.4314          
#>          Pos Pred Value : 0.5043          
#>          Neg Pred Value : 0.5301          
#>              Prevalence : 0.4900          
#>          Detection Rate : 0.2950          
#>    Detection Prevalence : 0.5850          
#>       Balanced Accuracy : 0.5167          
#>                                           
#>        'Positive' Class : 1               
#> 

A priority metric of “Recall” at 0.60 means that the model can correctly predict players who have suffered an “injury” at a rate of 60%.

Conclusion

• In predicting players who experience injury and those who do not using 6 predictor variables: Player_Age, Player_Weight, Player_Height, Previous_Injuries, Training_Intensity, and Recovery_Time.

• In the model evaluation stage for both models above, we use the confusion matrix function with the priority metric being Sensitivity/recall. The focus of the prediction model is on the model that, out of all actual positive data (injuries), can predict correctly.

• The risk if the model fails to predict accurately: ** FN: Predicted as not injured, but actually injured -> delayed treatment ** FP: Predicted as injured, but actually not injured -> incorrect treatment Conclusion: FN is more dangerous, priority metric: Recall

• The model using the Logistic Regression algorithm obtained a Sensitivity value of 53%, meaning it correctly predicts the proportion of actual positive data. After tuning the model by reduce the prediction threshold by 0.05, a Sensitivity value of 87% was obtained.

• In the model using the K-Nearest Neighbor algorithm, a Sensitivity value of 60% was obtained.

• Both models have a difference in Sensitivity value of 7%, with the K-Nearest Neighbor algorithm having a larger value than the Logistic Regression algorithm. It can be concluded that both models need a tuning process to achieve better results.

• In this case, the suitable model is Logistic Regression with a Sensitivity value of 87%.