Injury Prediction
Introduction
To concern player safety and injury prevention in competitive sports, the dataset designed specifically for injury prediction. The Dataset representation of player health and injury data. Our synthetic dataset captures critical attributes such as player demographics, training intensities, recovery times, and previous injury histories. We establish correlations between these features and the likelihood of future injuries to accurately simulate real-world scenarios.
In this project of Programming for Data Science with R, we would like to make a predict model which performing predictions for players safety and injury prevention in competitive sports based on the categories of several supporting variables. The algorithms I will use are logistic regression and k-nearest neighbors. This data collected from kaggle.
Data Preparation
Import Library
Load Dataset
#> Player_Age Player_Weight Player_Height Previous_Injuries Training_Intensity
#> 1 24 66.25193 175.7324 1 0.4579290
#> 2 37 70.99627 174.5817 0 0.2265216
#> 3 32 80.09378 186.3296 0 0.6139703
#> 4 28 87.47327 175.5042 1 0.2528581
#> 5 25 84.65922 190.1750 0 0.5776318
#> 6 38 75.82055 206.6318 1 0.3592087
#> Recovery_Time Likelihood_of_Injury
#> 1 5 0
#> 2 6 1
#> 3 2 1
#> 4 4 1
#> 5 1 1
#> 6 4 0
Column Description
Predictor Variabels
- Player_Age : Age of the player in years.
- Player_Weight : Weight of the player in kilograms, following a normal distribution with mean 75 and standard deviation 10.
- Player_Height : Height of the player in centimeters, following a normal distribution with mean 180 and standard deviation 10.
- Previous_Injuries : Binary indicator (0 or 1) representing whether the player has had previous injuries (1) or not (0).
- Training_Intensity : A value between 0 and 1 representing the intensity of the player’s training regimen.
- Recovery_Time : Number of days required for the player to recover from an injury, ranging from 1 to 6 days.
Target Variabel
- Likelihood_of_Injury : Binary indicator (0 or 1) representing the likelihood of the player experiencing an injury (1) or not (0).
Data Processing
Check General Data Information
#> Rows: 1,000
#> Columns: 7
#> $ Player_Age <int> 24, 37, 32, 28, 25, 38, 24, 36, 28, 28, 38, 21, 2…
#> $ Player_Weight <dbl> 66.25193, 70.99627, 80.09378, 87.47327, 84.65922,…
#> $ Player_Height <dbl> 175.7324, 174.5817, 186.3296, 175.5042, 190.1750,…
#> $ Previous_Injuries <int> 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0…
#> $ Training_Intensity <dbl> 0.45792899, 0.22652163, 0.61397031, 0.25285812, 0…
#> $ Recovery_Time <int> 5, 6, 2, 4, 1, 4, 2, 3, 1, 1, 3, 4, 5, 4, 2, 6, 5…
#> $ Likelihood_of_Injury <int> 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1…
From the dataset above, the data has 7 columns, 1000 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.
Missing Value
#> Player_Age Player_Weight Player_Height
#> 0 0 0
#> Previous_Injuries Training_Intensity Recovery_Time
#> 0 0 0
#> Likelihood_of_Injury
#> 0
In the dataset above, has no missing value data in any columns.
Duplicates
#> [1] 0
In the dataset above, has no duplicates data in any columns.
Checking Target Variabels Class
#>
#> 0 1
#> 0.5 0.5
#>
#> 0 1
#> 500 500
Checking the proportion of the target variable class “Likelihood_of_Injury,” it was found that the class proportions are 50% for “0” (not-injury) and 50% for class “1” (injury).
Modelling
Logistic Regression
Logistic regression is a statistical model used for binary classification tasks, where the outcome variable (dependent variable) is categorical and has two possible outcomes, typically labeled as 0 and 1. It is an extension of linear regression, but instead of predicting the value of a continuous variable, logistic regression predicts the probability that an observation belongs to a certain category.
Train_Test Split Data
Before we make the predict model, we should to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 80% of the data as the training data and the rest of it as the testing data.
set.seed(221)
index <- sample(nrow(injury_clean), nrow(injury_clean)*0.8)
injury_train <- injury_clean[index, ]
injury_test <- injury_clean[-index, ]
injury_clean$Likelihood_of_Injury %>%
levels()#> [1] "0" "1"
model_lgr <- glm(formula = Likelihood_of_Injury ~., family = "binomial",
data = injury_train)
summary(model_lgr)#>
#> Call:
#> glm(formula = Likelihood_of_Injury ~ ., family = "binomial",
#> data = injury_train)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.375709 1.427224 -0.964 0.3351
#> Player_Age 0.001195 0.010862 0.110 0.9124
#> Player_Weight 0.000704 0.007292 0.097 0.9231
#> Player_Height 0.006295 0.007324 0.860 0.3900
#> Previous_Injuries 0.114073 0.142648 0.800 0.4239
#> Training_Intensity 0.571054 0.249108 2.292 0.0219 *
#> Recovery_Time -0.046091 0.042203 -1.092 0.2748
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 1108.9 on 799 degrees of freedom
#> Residual deviance: 1100.6 on 793 degrees of freedom
#> AIC: 1114.6
#>
#> Number of Fisher Scoring iterations: 4
Prediction
Based oh the model described above “model_lgr” , We will try to make predictions using the test data that we already possess.
injury_test$prob_injury<-predict(model_lgr, type = "response", newdata = injury_test)
head(injury_test$prob_injury)#> [1] 0.4877680 0.4746042 0.5486059 0.5769343 0.4824800 0.4965684
Now, We will try to observe the distribution of prediction probabilities for the data.
ggplot(injury_test, aes(x=prob_injury)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()In the graph above, it can be interpreted that the prediction results tend to cluster towards the middle, indicating that some players experience injury while others do not.
injury_test$pred_injury <- factor(ifelse(injury_test$prob_injury > 0.5, "1","0"))
injury_test[1:10, c("pred_injury", "Likelihood_of_Injury")]#> pred_injury Likelihood_of_Injury
#> 1 0 0
#> 4 0 1
#> 7 1 0
#> 8 1 1
#> 11 0 0
#> 17 0 1
#> 19 0 0
#> 21 1 0
#> 23 1 0
#> 33 1 0
In the syntax above, when the probability of the test data > 0.5, it means that the player “1 = not-injury” or “0 = Injury.”
Evaluation Model
To evaluate the model we have built, we will use a confusion matrix
library(caret)
log_conf <- confusionMatrix(injury_test$pred_injury, injury_test$Likelihood_of_Injury, positive = "1")
log_conf#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 54 45
#> 1 51 50
#>
#> Accuracy : 0.52
#> 95% CI : (0.4484, 0.591)
#> No Information Rate : 0.525
#> P-Value [Acc > NIR] : 0.5845
#>
#> Kappa : 0.0405
#>
#> Mcnemar's Test P-Value : 0.6098
#>
#> Sensitivity : 0.5263
#> Specificity : 0.5143
#> Pos Pred Value : 0.4950
#> Neg Pred Value : 0.5455
#> Prevalence : 0.4750
#> Detection Rate : 0.2500
#> Detection Prevalence : 0.5050
#> Balanced Accuracy : 0.5203
#>
#> 'Positive' Class : 1
#>
- Recall/Sensitivity : Out of all actual positive data, how the model is able correctly predict the proportion.
- Specificity : Out of all actual negative data, how the model is able to correctly predict the proportion.
- Accuracy : How the model is able to correctly predict the target Y.
- Precision : Out of all the predictions made, how the model is able to correctly predict the positive class.
Recall <- round((50)/(50+45),2)
Specificity <- round((54)/(54+51),2)
Accuracy <- round((50+54)/(51+50+45+54),2)
Precision <- round((50)/(50+51),2)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance#> Accuracy Recall Precision Specificity
#> 1 0.52 0.53 0.5 0.51
Based on the above confusion matrix results, we got information that:
- the model’s ability to predict the target Y (injury and not injury) is 52%.
- out of all actual “not injury” data, the model is able to correctly predict 51%.
- Out of all actual “injury” player data, the model is capable of correctly predicting 53%.
- Out of all the predictions made by the model, it is able to correctly predict the positive class at 50%.
Tunning Model
In this case, we will focus on the “recall” metric due to the priority predictions is accurately predict actual injured players’ data.
For logistic regression model tuning stage, we will increase the recall value by adjusting the prediction threshold.
# meningkatkan recall
injury_test$pred_injury_new <- ifelse(injury_test$prob_injury > 0.45, "1","0")
injury_test[1:10, c("pred_injury_new", "Likelihood_of_Injury")]#> pred_injury_new Likelihood_of_Injury
#> 1 1 0
#> 4 1 1
#> 7 1 0
#> 8 1 1
#> 11 1 0
#> 17 1 1
#> 19 0 0
#> 21 1 0
#> 23 1 0
#> 33 1 0
library(caret)
log_conf2 <- confusionMatrix(data = as.factor(injury_test$pred_injury_new), injury_test$Likelihood_of_Injury, positive = "1")
log_conf2#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 26 12
#> 1 79 83
#>
#> Accuracy : 0.545
#> 95% CI : (0.4733, 0.6154)
#> No Information Rate : 0.525
#> P-Value [Acc > NIR] : 0.3105
#>
#> Kappa : 0.1174
#>
#> Mcnemar's Test P-Value : 4.559e-12
#>
#> Sensitivity : 0.8737
#> Specificity : 0.2476
#> Pos Pred Value : 0.5123
#> Neg Pred Value : 0.6842
#> Prevalence : 0.4750
#> Detection Rate : 0.4150
#> Detection Prevalence : 0.8100
#> Balanced Accuracy : 0.5607
#>
#> 'Positive' Class : 1
#>
Recall <- round((83)/(83+12),2)
Specificity <- round((26)/(26+79),2)
Accuracy <- round((26+83)/(26+83+12+79),2)
Precision <- round((83)/(83+79),2)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance#> Accuracy Recall Precision Specificity
#> 1 0.54 0.87 0.51 0.25
Based on the tuning results above, by reduce the prediction threshold from 0.50 to 0.45, we got a better recall result of 0.87. It’s means the model is able to correctly predict 87% of the cases.
K-Nearest Neighbour
The k-NN (K-nearest neighbors) algorithm classifies new data by comparing the characteristics of the new data (test data) with the existing data (training data).
Pre-Processing Data
Checking range predictor variable
#> Player_Age Player_Weight Player_Height Previous_Injuries
#> Min. :18.00 Min. : 40.19 Min. :145.3 Min. :0.000
#> 1st Qu.:22.00 1st Qu.: 67.94 1st Qu.:173.0 1st Qu.:0.000
#> Median :28.00 Median : 75.02 Median :180.0 Median :1.000
#> Mean :28.23 Mean : 74.79 Mean :179.8 Mean :0.515
#> 3rd Qu.:34.00 3rd Qu.: 81.30 3rd Qu.:186.6 3rd Qu.:1.000
#> Max. :39.00 Max. :104.65 Max. :207.3 Max. :1.000
#> Training_Intensity Recovery_Time Likelihood_of_Injury
#> Min. :0.0000307 Min. :1.000 0:500
#> 1st Qu.:0.2410415 1st Qu.:2.000 1:500
#> Median :0.4839116 Median :4.000
#> Mean :0.4905380 Mean :3.466
#> 3rd Qu.:0.7304045 3rd Qu.:5.000
#> Max. :0.9977494 Max. :6.000
The range of each variable differs, hence it is necessary to perform feature scaling.
Cross Validation
Separate Predictor & Target
This step is performed to separate predictor data from target data.
Scalling Data
Scaling serves to standardize the range of predictor variables.
#> Player_Age Player_Weight Player_Height Previous_Injuries Training_Intensity
#> 415 -1.537882 0.3128078 -0.1679504 1.0068984 0.85272772
#> 463 1.177026 0.3612517 1.0335977 -0.9919074 -0.04329597
#> 179 1.327854 -1.5265176 -1.8343178 1.0068984 1.35058678
#> Recovery_Time
#> 415 -1.4474063
#> 463 1.5019697
#> 179 0.9120945
# test data scaling data
injury_test_xs <- scale(x = injury_test_x,
center = attr(injury_train_xs, "scaled:center"), # nilai rata-rata
scale = attr(injury_train_xs, "scaled:scale")) # nilai standar deviasi
injury_test_xs %>% head(3)#> Player_Age Player_Weight Player_Height Previous_Injuries Training_Intensity
#> 1 -0.6329129 -0.8411877 -0.4361923 1.0068984 -0.1123648
#> 3 0.5737128 0.5474231 0.6349032 -0.9919074 0.4406955
#> 7 -0.6329129 -0.4525372 -0.3035678 -0.9919074 1.1835209
#> Recovery_Time
#> 1 0.9120945
#> 3 -0.8575311
#> 7 -0.8575311
Predict
#> [1] 28.28427
library(class) # package `knn()`
injury_pred_knn <- knn(train = injury_train_xs,
test = injury_test_xs,
cl = injury_train_y,
k = 28)
head(injury_pred_knn, 10)#> [1] 1 1 1 1 0 1 1 0 1 1
#> Levels: 0 1
Model evaluation
# confusion matrix
library(caret)
confusionMatrix(data = injury_pred_knn,
reference = injury_test_y,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 44 39
#> 1 58 59
#>
#> Accuracy : 0.515
#> 95% CI : (0.4435, 0.5861)
#> No Information Rate : 0.51
#> P-Value [Acc > NIR] : 0.47201
#>
#> Kappa : 0.0333
#>
#> Mcnemar's Test P-Value : 0.06761
#>
#> Sensitivity : 0.6020
#> Specificity : 0.4314
#> Pos Pred Value : 0.5043
#> Neg Pred Value : 0.5301
#> Prevalence : 0.4900
#> Detection Rate : 0.2950
#> Detection Prevalence : 0.5850
#> Balanced Accuracy : 0.5167
#>
#> 'Positive' Class : 1
#>
A priority metric of “Recall” at 0.60 means that the model can correctly predict players who have suffered an “injury” at a rate of 60%.
Conclusion
• In predicting players who experience injury and those who do not using 6 predictor variables: Player_Age, Player_Weight, Player_Height, Previous_Injuries, Training_Intensity, and Recovery_Time.
• In the model evaluation stage for both models above, we use the confusion matrix function with the priority metric being Sensitivity/recall. The focus of the prediction model is on the model that, out of all actual positive data (injuries), can predict correctly.
• The risk if the model fails to predict accurately: ** FN: Predicted as not injured, but actually injured -> delayed treatment ** FP: Predicted as injured, but actually not injured -> incorrect treatment Conclusion: FN is more dangerous, priority metric: Recall
• The model using the Logistic Regression algorithm obtained a Sensitivity value of 53%, meaning it correctly predicts the proportion of actual positive data. After tuning the model by reduce the prediction threshold by 0.05, a Sensitivity value of 87% was obtained.
• In the model using the K-Nearest Neighbor algorithm, a Sensitivity value of 60% was obtained.
• Both models have a difference in Sensitivity value of 7%, with the K-Nearest Neighbor algorithm having a larger value than the Logistic Regression algorithm. It can be concluded that both models need a tuning process to achieve better results.
• In this case, the suitable model is Logistic Regression with a Sensitivity value of 87%.