Seoul 2014 - League of Legends World Championship Finals - Samsung White VS Star Horn Royal Club
In this report, I describe how I selected the best classification model from three fitted models that predict the winning probability for a team, based on data from different professional regional League of Legends leagues from around the world in 2019.
You can check the final model in a Shiny App HERE
Github Repository HERE
League of Legends (abbreviated LoL) is a multiplayer online battle arena video game developed and published by Riot Games for Microsoft Windows and macOS. The goal in the game is usually to destroy the opposing team’s “Nexus”, a structure that lies at the heart of a base protected by defensive structures.
League of Legends has an active and widespread competitive scene. In North America and Europe, Riot Games organizes the League Championship Series (LCS), located in Los Angeles and the League of Legends European Championship (LEC), located in Berlin respectively. Similar regional competitions exist in China (LPL), South Korea (LCK), Taiwan/Hong Kong/Macau (LMS), and various other regions. These regional competitions culminate with the annual World Championship.
The data was extracted from Oracle’s Elixir League of Legends Esports Statistics. It comprises game results from de CBLoL, LCK, LCS, LEC, LMS and MSI in 2019.
The data has 2734 observations on 5 factor variables and 2 numeric variables.
Data Authors:
Tim “Magic” Sevenhuysen of OraclesElixir.com.
Special thanks to the above mentioned for allowing their data free of charge to be used by analysts, commentators, and fans.
Important:
Some of the variables were not in the original dataset. To know more about how I cleaned and arranged the raw data, and how I built some of the variables in the list, check out my R code HERE.
Lets take a look at the first rows of the tidy data to know how it looks like.
# Setting Working Directory
setwd("~/Diego/Analytics/LoL Analytics/Logistic Model")
# Loading Libraries
library(dplyr)
library(readxl)
# Loading the .xlsx dataset into RStudio.
ModelData <- read_excel(path = "~/Diego/Analytics/LoL Analytics/Logistic Model/Extra Files/ModelData.xlsx",
col_types = c(rep("guess", 5), rep("numeric", 2)), sheet = 1, col_names = TRUE)
# Giving Format To Factor Variables
ModelData$result <- as.factor(ModelData$result)
ModelData$side <- as.factor(ModelData$side)
ModelData$elementalsd <- as.factor(ModelData$elementalsd)
ModelData$elderd <- as.factor(ModelData$elderd)
ModelData$barond <- as.factor(ModelData$barond)
result | side | elementalsd | elderd | barond | wardratio | gspd |
---|---|---|---|---|---|---|
Defeat | Blue | [-3,3] | 0 | -1 | 2.28 | -0.027 |
Victory | Red | [-3,3] | 0 | 1 | 2.47 | 0.027 |
Victory | Blue | [-6,-4] | 0 | [2,4] | 2.15 | 0.046 |
Defeat | Red | [4,6] | 0 | [-4,-2] | 2.05 | -0.046 |
Defeat | Blue | [-3,3] | 0 | -1 | 2.62 | 0.034 |
Victory | Red | [-3,3] | 0 | 1 | 2.29 | -0.034 |
We can start looking at how the observations are distributed by result
(game result) VS each variable, to make sure that they are all represented by a number of games. Because of this, barond
(Baron Nashors difference) and elementalsd
(Elemental Dragons Difference) have grouped levels to gain consistency and statistical significance.
Variables.
gspd
is positive, the team has got more achieved objectives, or has scored more kills/assistances during the game, or both. We are preparing the data for prediction by splitting ModelData
into 75% as ModelTrain
, and 25% as ModelTest
. This splitting will serve to test the models accuracy.
#Partitioning
library(caret)
set.seed(12345)
split <- createDataPartition(ModelData$result, p = 0.75, list = FALSE)
ModelTrain <- ModelData[split, ]
ModelTest <- ModelData[-split, ]
To predict the outcome, I will use three different methods based on ModelTrain
dataset:
- Logistic Regression
- Random Forests
- Decision Trees
Then, they will be applied to the ModelTest
dataset to compare accuracies. The best model will be used to predict the winning team.
side
), we set the “Blue” side as our intercept for this regressor. elementalsd
), Elder Dragons (elederd
) and Barons Nashors (barond
) against the opposing team. # Re Leveling Categorical Variables as the models' intercept
ModelTrain$side <- relevel(ModelTrain$side, ref = "Blue")
ModelTrain$elementalsd <- relevel(ModelTrain$elementalsd, ref = "[-3,3]")
ModelTrain$elderd <- relevel(ModelTrain$elderd, ref = "0")
ModelTrain$barond <- relevel(ModelTrain$barond, ref = "0")
Let’s start by trying to predict a victory in League of Legends using all the variables available. We are storing the output in a variable called logistic
.
# Base Model Fit
logistic <- glm(result ~ side + elementalsd + elderd + barond + gspd + wardratio,
data = ModelTrain, family ="binomial")
sideRed
= 1), over the odds of winning a game playing in the blue side of the map (sideRed
= 0) is exp(-0.5088) = 0.6. In terms of percent change, we can say that the odds for sideRed are 40% lower than the odds for blue side.elementalsd[4,6]
= 1) over the odds of winning a game having a 0 difference, is exp(1.2784) = 3.59. In terms of percent change, we can say that the odds for a 4 to 6 difference in Elemental Dragons are 259% higher than the odds for 0 difference.elderd-2 = 1
) over the odds of winning a game having a 0 difference, is exp(-2.4659) = 0.08. In terms of percent change, we can say that the odds for a -2 difference in Elder Dragons are 92% lower than the odds for 0 difference.barond1
= 1) over the odds of winning a game having a 0 difference, is exp(1.3898) = 4.01. In terms of percent change, we can say that the odds for a +1 difference in Baron Nashors are 301% higher than the odds for 0 difference.wardratio
) since exp(0.6136) = 1.847069.gspd
), since exp(18.1315) = 74887684. However, is not accurate to use one-unit increase as refference since approximately 70% of the observations in the dataset are in between the range of -0.14 and 0.14.summary(logistic)
##
## Call:
## glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6029 -0.2390 -0.0007 0.2125 3.6448
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2099 0.5992 -2.019 0.04349 *
## sideRed -0.5088 0.1859 -2.738 0.00619 **
## elementalsd[-6,-4] -1.1366 0.4193 -2.711 0.00671 **
## elementalsd[4,6] 1.2784 0.4152 3.079 0.00208 **
## elderd-1 -0.6950 0.3078 -2.258 0.02395 *
## elderd-2 -2.4659 1.1220 -2.198 0.02797 *
## elderd1 0.6177 0.3063 2.016 0.04375 *
## elderd2 4.0823 1.4726 2.772 0.00557 **
## barond-1 -1.3529 0.2474 -5.468 4.56e-08 ***
## barond[-4,-2] -1.7851 0.3321 -5.375 7.64e-08 ***
## barond[2,4] 1.8999 0.3518 5.401 6.64e-08 ***
## barond1 1.3898 0.2508 5.541 3.01e-08 ***
## gspd 18.1315 1.4250 12.724 < 2e-16 ***
## wardratio 0.6136 0.2309 2.658 0.00787 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2844.68 on 2051 degrees of freedom
## Residual deviance: 826.78 on 2038 degrees of freedom
## AIC: 854.78
##
## Number of Fisher Scoring iterations: 7
The Stepwise Backward Elimination with Akaike information criterion (AIC), does not provide proof of a model in the sense of testing a null hypothesis. That is, AIC cannot say anything about the quality of the model in an absolute sense. If all the candidate models fit poorly, AIC will not give signal of it. What this criterion does, is to penalize models with many parameters against the most parsimonious models.
library(MASS)
AIC.step <- stepAIC(logistic, scope = list(upper = logistic$formula, lower = ~1), direction = "backward")
## Start: AIC=854.78
## result ~ side + elementalsd + elderd + barond + gspd + wardratio
##
## Df Deviance AIC
## <none> 826.78 854.78
## - wardratio 1 833.98 859.98
## - side 1 834.35 860.35
## - elementalsd 2 847.11 871.11
## - elderd 4 857.57 877.57
## - barond 4 1038.22 1058.22
## - gspd 1 1080.49 1106.49
The Backward Elimination procedure is suggesting in the first step to not remove any variables from the model. However, wardratio
is considered to be removed in the second step. Since this variable did not show a strong trend in the exploratoy data analysis, we are fitting a new model with the other regressors to evaluate its significance. We are storing the output in a variable called logistic2
.
# Reduced Model Fit
logistic2 <- glm(result ~ side + elementalsd + elderd + barond + gspd, data = ModelTrain, family ="binomial")
summary(logistic2)
##
## Call:
## glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd, family = "binomial", data = ModelTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6498 -0.2321 -0.0004 0.2174 3.6908
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2955 0.1994 1.482 0.13842
## sideRed -0.5458 0.1844 -2.960 0.00308 **
## elementalsd[-6,-4] -1.2120 0.4158 -2.915 0.00356 **
## elementalsd[4,6] 1.2686 0.4122 3.078 0.00209 **
## elderd-1 -0.7888 0.3054 -2.583 0.00981 **
## elderd-2 -2.7014 1.1025 -2.450 0.01428 *
## elderd1 0.5379 0.3034 1.773 0.07628 .
## elderd2 4.0470 1.5150 2.671 0.00756 **
## barond-1 -1.3807 0.2452 -5.631 1.79e-08 ***
## barond[-4,-2] -1.7777 0.3314 -5.364 8.12e-08 ***
## barond[2,4] 1.9186 0.3509 5.468 4.55e-08 ***
## barond1 1.3794 0.2494 5.532 3.17e-08 ***
## gspd 18.9022 1.4073 13.432 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2844.68 on 2051 degrees of freedom
## Residual deviance: 833.98 on 2039 degrees of freedom
## AIC: 859.98
##
## Number of Fisher Scoring iterations: 7
We are comparing the likelihood of the data under the full model, against the likelihood of the data under a model with fewer predictors. It is necessary to test whether the observed difference between the two models is statistically significant.
Given that H0 holds that the reduced model is true, a p-value for the overall model fit statistic that is less than 0.05 would compel us to reject the null hypothesis. It would provide evidence against the reduced model (logistic2
) in favor of the current model. (logistic
).
# Testing with Anova
anova(logistic, logistic2, test ="Chisq")
## Analysis of Deviance Table
##
## Model 1: result ~ side + elementalsd + elderd + barond + gspd + wardratio
## Model 2: result ~ side + elementalsd + elderd + barond + gspd
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2038 826.78
## 2 2039 833.98 -1 -7.1989 0.007295 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, both lrtest and anova test are favoring the current model (logistic
). So we will continue our analysis with this model.
# Testing with lrtest function
library(lmtest)
lrtest(logistic, logistic2)
## Likelihood ratio test
##
## Model 1: result ~ side + elementalsd + elderd + barond + gspd + wardratio
## Model 2: result ~ side + elementalsd + elderd + barond + gspd
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 14 -413.39
## 2 13 -416.99 -1 7.1989 0.007295 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is the overall predictive power of the model. Unlike linear regression with ordinary least squares estimation, there is no R2 statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R2 metrics that could be of value. Most notable is McFadden’s R2.
In this case, the measure reports 70.93% predictive power, which is good.
library(pscl)
pR2(logistic)[4]
## McFadden
## 0.7093593
P-Value: The Pseudo R-squared needs to calculate an overall p-value with the Chi-square distribution. Because it’s so small, our confidence level for the Pseudo R-squared is very high. Which means that this model is statistically significant.
with(logistic, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = F))
## [1] 0
This test is commonly used to determine the significance of odds-ratios in logistic regression. The Wald Test take advantage of the fact that log(odds-ratios) (just like log(odds)), are normally distributed. The idea is to test the hypothesis that the coefficient of an independent variable in the model is significantly different from zero.
If the test fails to reject the null hypothesis, this suggests that removing the variable from the model will not substantially harm the fit of that model.
library(survey)
regTermTest(logistic, "side")
## Wald test for side
## in glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
## F = 7.495472 on 1 and 2038 df: p= 0.0062392
regTermTest(logistic, "elementalsd")
## Wald test for elementalsd
## in glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
## F = 8.804368 on 2 and 2038 df: p= 0.00015586
regTermTest(logistic, "elderd")
## Wald test for elderd
## in glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
## F = 5.778124 on 4 and 2038 df: p= 0.00012699
regTermTest(logistic, "barond")
## Wald test for barond
## in glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
## F = 44.73319 on 4 and 2038 df: p= < 2.22e-16
regTermTest(logistic, "wardratio")
## Wald test for wardratio
## in glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
## F = 7.063494 on 1 and 2038 df: p= 0.0079285
regTermTest(logistic, "gspd")
## Wald test for gspd
## in glm(formula = result ~ side + elementalsd + elderd + barond +
## gspd + wardratio, family = "binomial", data = ModelTrain)
## F = 161.8944 on 1 and 2038 df: p= < 2.22e-16
To assess the relative importance of individual predictors in the model, we can also look at the absolute value of the t-statistic for each model parameter.
varImp(logistic)
Overall | |
---|---|
sideRed | 2.737786 |
elementalsd[-6,-4] | 2.710659 |
elementalsd[4,6] | 3.078925 |
elderd-1 | 2.257989 |
elderd-2 | 2.197726 |
elderd1 | 2.016484 |
elderd2 | 2.772249 |
barond-1 | 5.467608 |
barond[-4,-2] | 5.375319 |
barond[2,4] | 5.400520 |
barond1 | 5.540678 |
gspd | 12.723774 |
wardratio | 2.657723 |
The Receiver Operating Characteristics traces the percentage of true positives accurately predicted by a given logit model, as the prediction probability threshold is lowered from 1 to 0. For a good model, as the threshold is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. For a good model, the curve should rise steeply, indicating that the TPR (Y-Axis) increases faster than the FPR (X-Axis) as the threshold score decreases.
# Store the predicted values for training dataset in "Pred_Train" variable.
Pred_Train <- predict(logistic, ModelTrain, type="response")
# Load ROCR library
library(ROCR)
# Define the ROCRPred and ROCRPerf variables
ROCRPred <- prediction(Pred_Train, ModelTrain$result)
ROCRPref <- performance(ROCRPred, "tpr", "fpr")
The greater the area under the ROC curve, the better predictive ability of the model.
# Area under the curve
library(pROC)
ROC1 <- roc(as.factor(ifelse(ModelTrain$result == "Victory", 1, 0)), Pred_Train)
auc(ROC1)
## Area under the curve: 0.9747
Here we compare a selected threshold of 0.5 vs 0.3 and 0.7 values, to see how the model responses to the test dataset. However, the best threshold is 0.5 because it reports the highest accuracy.
As we are talking about predicting a video game result, and not a delicate matter like having a disease or not, there is no need of reducing the threshold to obtain less true positive values in expense of more false positive values. That would also led the model to loose some performance.
# Logistic Model Predictions on test Dataset
Pred_Test <- predict(logistic, ModelTest, type="response")
# Testing different thresholds for the Logistic Model
for (k in seq(0.3,0.7, by = 0.2)) {
Model.test.observed <- as.factor(ifelse(ModelTest$result == "Victory", 1, 0))
Model.test.predt <- function(k) ifelse(Pred_Test > k , 1,0)
CM_Test <- confusionMatrix(as.factor(Model.test.predt(k)), Model.test.observed)$overall[1]
Temp1 <- paste("CM_Test", k, sep = "_")
assign(Temp1, CM_Test)
}
Threshold_0.3 | Threshold_0.5 | Threshold_0.7 | |
---|---|---|---|
Accuracy | 0.9090909 | 0.9222874 | 0.9120235 |
Below we can see the confusion matrix for the model with the definitive 0.5 threshold, showing the comparisson between the predicted target variable versus the observed values for each observation.
# Threshold
CM_Test <- confusionMatrix(as.factor(Model.test.predt(0.5)), Model.test.observed)
AC_Test <- CM_Test$overall[1]
CM_Test
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 313 25
## 1 28 316
##
## Accuracy : 0.9223
## 95% CI : (0.8996, 0.9412)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8446
##
## Mcnemar's Test P-Value : 0.7835
##
## Sensitivity : 0.9179
## Specificity : 0.9267
## Pos Pred Value : 0.9260
## Neg Pred Value : 0.9186
## Prevalence : 0.5000
## Detection Rate : 0.4589
## Detection Prevalence : 0.4956
## Balanced Accuracy : 0.9223
##
## 'Positive' Class : 0
##
Confusion Matrix
Now, we can compare the model’s accuracy using the training dataset Vs using the Test dataset.
As expected, the accuracy from the model using the test dataset is slightly lower than using the training dataset. This is probably because of the overfiting in the trainning model.
# Model's accuracy - Training dataset Vs using the Test dataset.
CM_Train <- table(ActualValue=ModelTrain$result, PredictedValue=Pred_Train > 0.5)
AC_Train <- sum(diag(CM_Train)/sum(CM_Train))
Accuracy.in.train | Accuracy.in.test | |
---|---|---|
Accuracy | 0.9269006 | 0.9222874 |
We can draw a graph that shows the predicted probablilities for a team to win a game, along with their actual win result status. Most of the teams that won the game (turquoise), are predicted to have a high probability of winning. And most of the teams that lost the game (salmon), are predicted to have a low winning probability.
This means that the logistic regression has done a pretty good job. However, we could use cross-validation to get a better idea of how well it might perform with new data.
With this technique we can evaluate the results of this statistical analysis when the data set has been segmented into a training sample and a test sample, the cross-validation checks whether the results of the analysis are independent of the partition. I.e. validating how well our model would perform with new data.
# Cross Validation Control
fitControl <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
# Model Fit
logistic_CV <- train(result ~ side + elementalsd + elderd + barond + gspd + wardratio, data=ModelTrain,
method="glm", family="binomial", trControl = fitControl)
# Testing Model Fit with the Test Dataset
Pred_Test_CV = predict(logistic_CV, ModelTest, type="prob")[,2]
# Creating Confusion Matrix of logistic_CV
Model.test.observed <- as.factor(ifelse(ModelTest$result == "Victory", 1, 0))
Model.test.predt <- function(k) ifelse(Pred_Test_CV > k, 1,0)
CM_Test_CV <- confusionMatrix(as.factor(Model.test.predt(0.5)),Model.test.observed)
AC_Test_CV <- CM_Test_CV$overall[1]
CM_Test_CV
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 313 25
## 1 28 316
##
## Accuracy : 0.9223
## 95% CI : (0.8996, 0.9412)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8446
##
## Mcnemar's Test P-Value : 0.7835
##
## Sensitivity : 0.9179
## Specificity : 0.9267
## Pos Pred Value : 0.9260
## Neg Pred Value : 0.9186
## Prevalence : 0.5000
## Detection Rate : 0.4589
## Detection Prevalence : 0.4956
## Balanced Accuracy : 0.9223
##
## 'Positive' Class : 0
##
Comparison between logistic model accuracies: Logistic.without.CV | Logistic.with.CV | |
---|---|---|
Accuracy | 0.9222874 | 0.9222874 |
We are fitting a Decision Trees Model now. The output is stored in a variable called DT_Model
.
# Decision Trees Model
DT_Model <- train(result ~ side + elementalsd + elderd + barond + gspd + wardratio, data=ModelTrain, method="rpart", trControl=fitControl)
# Testing the model
DT_Predict <- predict(DT_Model,newdata=ModelTest)
CM_DT_Test <- confusionMatrix(ModelTest$result,DT_Predict)
AC_DT_Test <- CM_DT_Test$overall[1]
# Display confusion matrix and model accuracy
CM_DT_Test
## Confusion Matrix and Statistics
##
## Reference
## Prediction Defeat Victory
## Defeat 302 39
## Victory 42 299
##
## Accuracy : 0.8812
## 95% CI : (0.8546, 0.9046)
## No Information Rate : 0.5044
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7625
##
## Mcnemar's Test P-Value : 0.8241
##
## Sensitivity : 0.8779
## Specificity : 0.8846
## Pos Pred Value : 0.8856
## Neg Pred Value : 0.8768
## Prevalence : 0.5044
## Detection Rate : 0.4428
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8813
##
## 'Positive' Class : Defeat
##
Lastly, we now fit a Decision Trees Model. The output is stored in a variable called RF_Model
.
# Random Forests Model
RF_Model <- train(result ~ side + elementalsd + elderd + barond + gspd + wardratio, data=ModelTrain, method="rf", trControl=fitControl, verbose=FALSE)
# Plot
plot(RF_Model,main="RF Model Accuracy by number of predictors")
# Testing the model
RF_Predict <- predict(RF_Model,newdata=ModelTest)
CM_RF_Test <- confusionMatrix(ModelTest$result,RF_Predict)
AC_RF_Test <- CM_RF_Test$overall[1]
# Display confusion matrix and model accuracy
CM_RF_Test
## Confusion Matrix and Statistics
##
## Reference
## Prediction Defeat Victory
## Defeat 312 29
## Victory 25 316
##
## Accuracy : 0.9208
## 95% CI : (0.8979, 0.94)
## No Information Rate : 0.5059
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8416
##
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9258
## Specificity : 0.9159
## Pos Pred Value : 0.9150
## Neg Pred Value : 0.9267
## Prevalence : 0.4941
## Detection Rate : 0.4575
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9209
##
## 'Positive' Class : Defeat
##
Logistic.Model | DecisionTrees.Model | RandomForest.Model | |
---|---|---|---|
Accuracy | 0.9222874 | 0.8812317 | 0.9208211 |
After comparing the accuracy rate values of the three models, it is clear that the logistic regression model (logistic
) is the best one. So, we will use the logistic model to predict the the winning teams. Allthough, random forests is very close and decision trees has a good percentage too.