Hello all!
I’m starting to document my explorations into data that I find interesting.
Today, we take a look at a Kaggle.com Data set. According to the description, this is the first 10 minutes worth of data from Diamond-Ranked Games.
My goal is to look at various metrics during the first 10 minutes of gameplay in LoL Diamond-Ranked Games and then find the strongest predictors winning.
First things first, the packages. Today I’ll be working with tidyverse, janitor, gtsummary and a few other packages that I’ll go over later.
library(tidyverse)
library(janitor)
library(gtsummary)
One thing that can be very helpful to quickly understand your data is to look at summary statistics and visualize distributions.
Before we jump in, let’s clean up the data.
lol <- lol %>% clean_names()
lol$game_id <- NULL #Remove gameID (not useful for my purposes)
lol$blue_wins <- as.factor(lol$blue_wins)
lol <- lol %>% mutate(blue_wins = fct_recode(blue_wins, "Red Wins" = "0", "Blue Wins" = "1"))
I initially wanted to build a table using the base R summary() function. However, too many variables can be overwhelming.
The gtsummary package is a nice tool to create summary tables. I won’t do justice to the functionality of this package, but it is quite useful and easy to implement.
lol_summary <- lol %>% tbl_summary(by = blue_wins) %>% #create summary table by group
modify_header(label = "**Variable**") %>% #change header label
bold_labels()#bold labels
set_gtsummary_theme(theme_gtsummary_compact()) #make summary table compact
lol_summary
| Variable | Red Wins, N = 49491 | Blue Wins, N = 49301 |
|---|---|---|
| blue_wards_placed | 16 (14, 20) | 17 (15, 20) |
| blue_wards_destroyed | 2.00 (1.00, 4.00) | 3.00 (2.00, 4.00) |
| blue_first_blood | 2000 (40%) | 2987 (61%) |
| blue_kills | 5.0 (3.0, 7.0) | 7.0 (5.0, 9.0) |
| blue_deaths | 7.00 (5.00, 9.00) | 5.00 (3.00, 7.00) |
| blue_assists | 5.0 (3.0, 8.0) | 7.0 (5.0, 10.0) |
| blue_elite_monsters | ||
| 0 | 3101 (63%) | 2055 (42%) |
| 1 | 1660 (34%) | 2353 (48%) |
| 2 | 188 (3.8%) | 522 (11%) |
| blue_dragons | 1284 (26%) | 2292 (46%) |
| blue_heralds | 752 (15%) | 1105 (22%) |
| blue_towers_destroyed | ||
| 0 | 4835 (98%) | 4580 (93%) |
| 1 | 113 (2.3%) | 316 (6.4%) |
| 2 | 1 (<0.1%) | 26 (0.5%) |
| 3 | 0 (0%) | 7 (0.1%) |
| 4 | 0 (0%) | 1 (<0.1%) |
| blue_total_gold | 15791 (14961, 16706) | 17030 (16103, 18050) |
| blue_avg_level | 6.80 (6.60, 7.00) | 7.00 (6.80, 7.20) |
| blue_total_experience | 17511 (16730, 18228) | 18408 (17700, 19145) |
| blue_total_minions_killed | 213 (198, 227) | 222 (208, 236) |
| blue_total_jungle_minions_killed | 48 (43, 56) | 52 (44, 59) |
| blue_gold_diff | -1157 (-2568, 166) | 1204 (-146, 2576) |
| blue_experience_diff | -934 (-2028, 147) | 866 (-214, 1987) |
| blue_cs_per_min | 21.30 (19.80, 22.70) | 22.25 (20.80, 23.60) |
| blue_gold_per_min | 1579 (1496, 1671) | 1703 (1610, 1805) |
| red_wards_placed | 17 (15, 20) | 16 (14, 19) |
| red_wards_destroyed | 3.00 (1.00, 4.00) | 2.00 (1.00, 3.00) |
| red_first_blood | 2949 (60%) | 1943 (39%) |
| red_kills | 7.00 (5.00, 9.00) | 5.00 (3.00, 7.00) |
| red_deaths | 5.0 (3.0, 7.0) | 7.0 (5.0, 9.0) |
| red_assists | 7.0 (5.0, 10.0) | 5.0 (3.0, 8.0) |
| red_elite_monsters | ||
| 0 | 1948 (39%) | 2999 (61%) |
| 1 | 2480 (50%) | 1722 (35%) |
| 2 | 521 (11%) | 209 (4.2%) |
| red_dragons | 2554 (52%) | 1527 (31%) |
| red_heralds | 968 (20%) | 613 (12%) |
| red_towers_destroyed | ||
| 0 | 4647 (94%) | 4836 (98%) |
| 1 | 280 (5.7%) | 87 (1.8%) |
| 2 | 22 (0.4%) | 7 (0.1%) |
| red_total_gold | 16977 (16082, 18027) | 15786 (14963, 16688) |
| red_avg_level | 7.00 (6.80, 7.20) | 6.80 (6.60, 7.00) |
| red_total_experience | 18421 (17722, 19154) | 17542 (16805, 18251) |
| red_total_minions_killed | 223 (209, 236) | 213 (198, 228) |
| red_total_jungle_minions_killed | 52 (45, 59) | 50 (44, 56) |
| red_gold_diff | 1157 (-166, 2568) | -1204 (-2576, 146) |
| red_experience_diff | 934 (-147, 2028) | -866 (-1987, 214) |
| red_cs_per_min | 22.30 (20.90, 23.60) | 21.30 (19.80, 22.80) |
| red_gold_per_min | 1698 (1608, 1803) | 1579 (1496, 1669) |
|
1
Statistics presented: median (IQR); n (%)
|
||
Sometimes it can be helpful to visualize the data to get a better understanding of what’s going on.
This is a wonderful solution from Dr. Simon Jackson
This solution calls for utilizing the purrr, tidyr, and ggplot2 packages (loading tidyverse will load those packages). I also loaded the janitor package to clean my variable names.
lol %>%
keep(is.numeric) %>% #purrr function that keeps returned TRUE columns (i.e., columns that are numeric)
gather() %>% #tidyr function that will convert columns into two columns with a key and a value
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free", ncol = 4, nrow = 10) + #create wrap for histograms
geom_histogram(fill = "dodgerblue1", alpha = .5) +
theme(strip.background = element_blank(), strip.placement = "outside") +
theme(axis.text.x = element_text(angle = 35))
As you can see from the plots, we have a range of distributions.
But, so what… ok, I admit that visualizing the distributions here is not as helpful as I would like. How about we try something different.
Remember, my goal here is to find variables that predict success (or wins).
A very simple way to understand variables that can predict when the blue team wins is with a stepwise regression. Stepwise regressions assess all possible combinations of variables within models to reduce the AIC (Akaike Information Criteria) value. AIC takes into consideration both model fit and parsimony. So when comparing models, the AIC will not only attempt to maximize prediction, it will also attempt to maintain model simplicity. Thus, the final model selection may exclude some variables that provide predictive power.
For the stepwise regression, I’m using an example from Akanksha Rawat. We will be using the MASS and pROC packages.
library(MASS)
library(pROC)
Next, build a logistic model with the data.
model1 <- glm(blue_wins ~ ., family = binomial, data = lol)
summary(model1)
##
## Call:
## glm(formula = blue_wins ~ ., family = binomial, data = lol)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7067 -0.8735 -0.1414 0.8671 2.7618
##
## Coefficients: (13 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.843e-01 1.242e+00 -0.310 0.756928
## blue_wards_placed -2.050e-03 1.340e-03 -1.530 0.126014
## blue_wards_destroyed 8.186e-04 1.145e-02 0.071 0.943031
## blue_first_blood 6.959e-02 5.249e-02 1.326 0.184901
## blue_kills -3.209e-02 3.057e-02 -1.050 0.293819
## blue_deaths 2.801e-02 3.064e-02 0.914 0.360642
## blue_assists -1.509e-02 1.163e-02 -1.298 0.194401
## blue_elite_monsters 4.056e-02 6.508e-02 0.623 0.533144
## blue_dragons 3.180e-01 9.354e-02 3.400 0.000674 ***
## blue_heralds NA NA NA NA
## blue_towers_destroyed -2.539e-01 1.452e-01 -1.749 0.080215 .
## blue_total_gold 2.742e-05 9.902e-05 0.277 0.781818
## blue_avg_level 1.021e-02 1.826e-01 0.056 0.955396
## blue_total_experience -3.747e-05 8.109e-05 -0.462 0.644047
## blue_total_minions_killed -4.206e-03 1.946e-03 -2.162 0.030639 *
## blue_total_jungle_minions_killed 2.729e-03 3.333e-03 0.819 0.412777
## blue_gold_diff 5.012e-04 6.945e-05 7.217 5.33e-13 ***
## blue_experience_diff 2.655e-04 6.062e-05 4.379 1.19e-05 ***
## blue_cs_per_min NA NA NA NA
## blue_gold_per_min NA NA NA NA
## red_wards_placed -1.541e-03 1.314e-03 -1.173 0.240902
## red_wards_destroyed -2.635e-03 1.149e-02 -0.229 0.818655
## red_first_blood NA NA NA NA
## red_kills NA NA NA NA
## red_deaths NA NA NA NA
## red_assists 1.488e-02 1.136e-02 1.310 0.190163
## red_elite_monsters -8.361e-02 6.966e-02 -1.200 0.230042
## red_dragons -2.045e-01 9.669e-02 -2.115 0.034423 *
## red_heralds NA NA NA NA
## red_towers_destroyed 3.903e-01 1.523e-01 2.562 0.010394 *
## red_total_gold NA NA NA NA
## red_avg_level -1.738e-02 1.828e-01 -0.095 0.924228
## red_total_experience NA NA NA NA
## red_total_minions_killed 5.409e-03 1.924e-03 2.812 0.004925 **
## red_total_jungle_minions_killed 6.230e-03 3.326e-03 1.873 0.061049 .
## red_gold_diff NA NA NA NA
## red_experience_diff NA NA NA NA
## red_cs_per_min NA NA NA NA
## red_gold_per_min NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13695 on 9878 degrees of freedom
## Residual deviance: 10411 on 9853 degrees of freedom
## AIC: 10463
##
## Number of Fisher Scoring iterations: 4
The results from model1 tell us we have a few problems. First, we are seeing NAs because there were singularities. The reason for the singularity may be due to multi-collinearity (two or more variables are extremely highly correlated) or a lack of variability. If we refer back to the data distributions, we can see that, for example, blue_gold_diff and blue_experience_diff are both very normally distributed. When you assess the correlation between these two variables, we find an r = .89. This is pretty high and may be the cause of the singularity issue. Another possible cause may be, for example, the number of games in which blue_heralds (i.e., the number of heralds killed by the blue team) = 0 is extremely high compared to the number of games in which blue_heralds = 1.
For the purposes of this exploration, I’m not going to dive into the singularity issue.
The next thing to do is perform do the stepwise regression.
step1 <- stepAIC(model1)
NOTE: I suppressed the results from the stepAIC function because it shows the evaluations of the model combinations (it’s a lot of output).
summary(step1)
##
## Call:
## glm(formula = blue_wins ~ blue_wards_placed + blue_first_blood +
## blue_kills + blue_deaths + blue_dragons + blue_towers_destroyed +
## blue_total_minions_killed + blue_gold_diff + blue_experience_diff +
## red_elite_monsters + red_dragons + red_towers_destroyed +
## red_total_minions_killed + red_total_jungle_minions_killed,
## family = binomial, data = lol)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6989 -0.8709 -0.1405 0.8691 2.7706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.469e-01 4.564e-01 -0.979 0.327449
## blue_wards_placed -2.189e-03 1.326e-03 -1.651 0.098772 .
## blue_first_blood 8.204e-02 5.216e-02 1.573 0.115757
## blue_kills -4.729e-02 2.019e-02 -2.343 0.019153 *
## blue_deaths 4.748e-02 2.119e-02 2.240 0.025068 *
## blue_dragons 3.663e-01 6.438e-02 5.689 1.27e-08 ***
## blue_towers_destroyed -1.940e-01 1.298e-01 -1.495 0.135009
## blue_total_minions_killed -4.314e-03 1.575e-03 -2.740 0.006151 **
## blue_gold_diff 4.951e-04 4.449e-05 11.128 < 2e-16 ***
## blue_experience_diff 2.767e-04 3.200e-05 8.647 < 2e-16 ***
## red_elite_monsters -1.023e-01 6.801e-02 -1.504 0.132556
## red_dragons -1.775e-01 9.427e-02 -1.883 0.059760 .
## red_towers_destroyed 3.908e-01 1.397e-01 2.798 0.005145 **
## red_total_minions_killed 5.043e-03 1.524e-03 3.308 0.000939 ***
## red_total_jungle_minions_killed 5.583e-03 2.884e-03 1.936 0.052856 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13695 on 9878 degrees of freedom
## Residual deviance: 10417 on 9864 degrees of freedom
## AIC: 10447
##
## Number of Fisher Scoring iterations: 4
We can see from the results that the stepAIC function found a model with 15 predictors that are associated with blue_wins.
Within the summary of that model, we can see that there are four variables that are highly predictive of blue_wins.
So let’s make a model with those and get in some visualizations.
model2 <- glm(blue_wins ~ blue_dragons + blue_gold_diff + blue_experience_diff + red_total_minions_killed, family = binomial, data = lol)
summary(model2)
##
## Call:
## glm(formula = blue_wins ~ blue_dragons + blue_gold_diff + blue_experience_diff +
## red_total_minions_killed, family = binomial, data = lol)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7518 -0.8788 -0.1381 0.8764 2.7186
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.834e-01 2.698e-01 -2.904 0.00369 **
## blue_dragons 5.366e-01 5.011e-02 10.709 < 2e-16 ***
## blue_gold_diff 4.102e-04 2.334e-05 17.572 < 2e-16 ***
## blue_experience_diff 2.488e-04 2.854e-05 8.715 < 2e-16 ***
## red_total_minions_killed 2.701e-03 1.234e-03 2.189 0.02863 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13695 on 9878 degrees of freedom
## Residual deviance: 10457 on 9874 degrees of freedom
## AIC: 10467
##
## Number of Fisher Scoring iterations: 4
Looking at the summary from model2, we see that actually red_total_minions_killed is no longer as highly predictive of blue_wins as it was when the other variables were included in model1. One possible/probable reason for this is excluding certain variables alters covariance.
What we can conclude from this basic look is that the number of dragons killed by the blue team (blue_dragons), the difference in gold between the blue team and red team (blue_gold_diff), the difference in experience between the blue team and red team (blue_experience_diff), and the total number of minions killed by the red team (red_total_minions_killed) were all important factors in determining if the blue team won.
A ROC curve is an nice method to assess your model’s performance. The end goal of this visualization will be to extract the AUC (area under the curve), which indicates how predictive the model is. Let’s go!
m2_roc <- roc(blue_wins~model2$fitted.values, data = lol, plot = TRUE, main = "LOL ROC Curve", col = "blue")
m2_roc
##
## Call:
## roc.formula(formula = blue_wins ~ model2$fitted.values, data = lol, plot = TRUE, main = "LOL ROC Curve", col = "blue")
##
## Data: model2$fitted.values in 4949 controls (blue_wins Red Wins) < 4930 cases (blue_wins Blue Wins).
## Area under the curve: 0.8103
The AUC shows us that our ability of distinguishing classes, in this case a blue win, based on the specified parameters is about .81.
Keep in mind, I used a simplified model and one that was not based on the Stepwise regression. This means that we could have an even higher AUC value and better predictive ability. But again, for the sake of parsimony and time, I chose to only look at those variables that seemed to have the greatest predictive power.
I hope you have found this exploration helpful!