League of Legend Kaggle Challenge

Hello all!

I’m starting to document my explorations into data that I find interesting.

Today, we take a look at a Kaggle.com Data set. According to the description, this is the first 10 minutes worth of data from Diamond-Ranked Games.

My goal is to look at various metrics during the first 10 minutes of gameplay in LoL Diamond-Ranked Games and then find the strongest predictors winning.

First things first, the packages. Today I’ll be working with tidyverse, janitor, gtsummary and a few other packages that I’ll go over later.

library(tidyverse)
library(janitor)
library(gtsummary)

Clean-up

One thing that can be very helpful to quickly understand your data is to look at summary statistics and visualize distributions.

Before we jump in, let’s clean up the data.

lol <- lol %>% clean_names()

lol$game_id <- NULL #Remove gameID (not useful for my purposes)
lol$blue_wins <- as.factor(lol$blue_wins)

lol <- lol %>% mutate(blue_wins = fct_recode(blue_wins, "Red Wins" = "0", "Blue Wins" = "1"))

Summary Table with gtsummary

I initially wanted to build a table using the base R summary() function. However, too many variables can be overwhelming.

The gtsummary package is a nice tool to create summary tables. I won’t do justice to the functionality of this package, but it is quite useful and easy to implement.

lol_summary <- lol %>% tbl_summary(by = blue_wins) %>% #create summary table by group
                modify_header(label = "**Variable**") %>% #change header label
                bold_labels()#bold labels

set_gtsummary_theme(theme_gtsummary_compact()) #make summary table compact

lol_summary

Variable	Red Wins, N = 4949¹	Blue Wins, N = 4930¹
blue_wards_placed	16 (14, 20)	17 (15, 20)
blue_wards_destroyed	2.00 (1.00, 4.00)	3.00 (2.00, 4.00)
blue_first_blood	2000 (40%)	2987 (61%)
blue_kills	5.0 (3.0, 7.0)	7.0 (5.0, 9.0)
blue_deaths	7.00 (5.00, 9.00)	5.00 (3.00, 7.00)
blue_assists	5.0 (3.0, 8.0)	7.0 (5.0, 10.0)
blue_elite_monsters
0	3101 (63%)	2055 (42%)
1	1660 (34%)	2353 (48%)
2	188 (3.8%)	522 (11%)
blue_dragons	1284 (26%)	2292 (46%)
blue_heralds	752 (15%)	1105 (22%)
blue_towers_destroyed
0	4835 (98%)	4580 (93%)
1	113 (2.3%)	316 (6.4%)
2	1 (<0.1%)	26 (0.5%)
3	0 (0%)	7 (0.1%)
4	0 (0%)	1 (<0.1%)
blue_total_gold	15791 (14961, 16706)	17030 (16103, 18050)
blue_avg_level	6.80 (6.60, 7.00)	7.00 (6.80, 7.20)
blue_total_experience	17511 (16730, 18228)	18408 (17700, 19145)
blue_total_minions_killed	213 (198, 227)	222 (208, 236)
blue_total_jungle_minions_killed	48 (43, 56)	52 (44, 59)
blue_gold_diff	-1157 (-2568, 166)	1204 (-146, 2576)
blue_experience_diff	-934 (-2028, 147)	866 (-214, 1987)
blue_cs_per_min	21.30 (19.80, 22.70)	22.25 (20.80, 23.60)
blue_gold_per_min	1579 (1496, 1671)	1703 (1610, 1805)
red_wards_placed	17 (15, 20)	16 (14, 19)
red_wards_destroyed	3.00 (1.00, 4.00)	2.00 (1.00, 3.00)
red_first_blood	2949 (60%)	1943 (39%)
red_kills	7.00 (5.00, 9.00)	5.00 (3.00, 7.00)
red_deaths	5.0 (3.0, 7.0)	7.0 (5.0, 9.0)
red_assists	7.0 (5.0, 10.0)	5.0 (3.0, 8.0)
red_elite_monsters
0	1948 (39%)	2999 (61%)
1	2480 (50%)	1722 (35%)
2	521 (11%)	209 (4.2%)
red_dragons	2554 (52%)	1527 (31%)
red_heralds	968 (20%)	613 (12%)
red_towers_destroyed
0	4647 (94%)	4836 (98%)
1	280 (5.7%)	87 (1.8%)
2	22 (0.4%)	7 (0.1%)
red_total_gold	16977 (16082, 18027)	15786 (14963, 16688)
red_avg_level	7.00 (6.80, 7.20)	6.80 (6.60, 7.00)
red_total_experience	18421 (17722, 19154)	17542 (16805, 18251)
red_total_minions_killed	223 (209, 236)	213 (198, 228)
red_total_jungle_minions_killed	52 (45, 59)	50 (44, 56)
red_gold_diff	1157 (-166, 2568)	-1204 (-2576, 146)
red_experience_diff	934 (-147, 2028)	-866 (-1987, 214)
red_cs_per_min	22.30 (20.90, 23.60)	21.30 (19.80, 22.80)
red_gold_per_min	1698 (1608, 1803)	1579 (1496, 1669)
¹ Statistics presented: median (IQR); n (%)

Visualize the data

Sometimes it can be helpful to visualize the data to get a better understanding of what’s going on.

This is a wonderful solution from Dr. Simon Jackson

This solution calls for utilizing the purrr, tidyr, and ggplot2 packages (loading tidyverse will load those packages). I also loaded the janitor package to clean my variable names.

lol %>%
  keep(is.numeric) %>% #purrr function that keeps returned TRUE columns (i.e., columns that are numeric)
  gather() %>% #tidyr function that will convert columns into two columns with a key and a value
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free", ncol = 4, nrow = 10) + #create wrap for histograms
    geom_histogram(fill = "dodgerblue1", alpha = .5) +
    theme(strip.background = element_blank(), strip.placement = "outside") +
    theme(axis.text.x = element_text(angle = 35))

As you can see from the plots, we have a range of distributions.

But, so what… ok, I admit that visualizing the distributions here is not as helpful as I would like. How about we try something different.

Remember, my goal here is to find variables that predict success (or wins).

Stepwise Regression

A very simple way to understand variables that can predict when the blue team wins is with a stepwise regression. Stepwise regressions assess all possible combinations of variables within models to reduce the AIC (Akaike Information Criteria) value. AIC takes into consideration both model fit and parsimony. So when comparing models, the AIC will not only attempt to maximize prediction, it will also attempt to maintain model simplicity. Thus, the final model selection may exclude some variables that provide predictive power.

For the stepwise regression, I’m using an example from Akanksha Rawat. We will be using the MASS and pROC packages.

library(MASS)
library(pROC)

Next, build a logistic model with the data.

model1 <- glm(blue_wins ~ ., family = binomial, data = lol)
summary(model1)

## 
## Call:
## glm(formula = blue_wins ~ ., family = binomial, data = lol)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7067  -0.8735  -0.1414   0.8671   2.7618  
## 
## Coefficients: (13 not defined because of singularities)
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -3.843e-01  1.242e+00  -0.310 0.756928    
## blue_wards_placed                -2.050e-03  1.340e-03  -1.530 0.126014    
## blue_wards_destroyed              8.186e-04  1.145e-02   0.071 0.943031    
## blue_first_blood                  6.959e-02  5.249e-02   1.326 0.184901    
## blue_kills                       -3.209e-02  3.057e-02  -1.050 0.293819    
## blue_deaths                       2.801e-02  3.064e-02   0.914 0.360642    
## blue_assists                     -1.509e-02  1.163e-02  -1.298 0.194401    
## blue_elite_monsters               4.056e-02  6.508e-02   0.623 0.533144    
## blue_dragons                      3.180e-01  9.354e-02   3.400 0.000674 ***
## blue_heralds                             NA         NA      NA       NA    
## blue_towers_destroyed            -2.539e-01  1.452e-01  -1.749 0.080215 .  
## blue_total_gold                   2.742e-05  9.902e-05   0.277 0.781818    
## blue_avg_level                    1.021e-02  1.826e-01   0.056 0.955396    
## blue_total_experience            -3.747e-05  8.109e-05  -0.462 0.644047    
## blue_total_minions_killed        -4.206e-03  1.946e-03  -2.162 0.030639 *  
## blue_total_jungle_minions_killed  2.729e-03  3.333e-03   0.819 0.412777    
## blue_gold_diff                    5.012e-04  6.945e-05   7.217 5.33e-13 ***
## blue_experience_diff              2.655e-04  6.062e-05   4.379 1.19e-05 ***
## blue_cs_per_min                          NA         NA      NA       NA    
## blue_gold_per_min                        NA         NA      NA       NA    
## red_wards_placed                 -1.541e-03  1.314e-03  -1.173 0.240902    
## red_wards_destroyed              -2.635e-03  1.149e-02  -0.229 0.818655    
## red_first_blood                          NA         NA      NA       NA    
## red_kills                                NA         NA      NA       NA    
## red_deaths                               NA         NA      NA       NA    
## red_assists                       1.488e-02  1.136e-02   1.310 0.190163    
## red_elite_monsters               -8.361e-02  6.966e-02  -1.200 0.230042    
## red_dragons                      -2.045e-01  9.669e-02  -2.115 0.034423 *  
## red_heralds                              NA         NA      NA       NA    
## red_towers_destroyed              3.903e-01  1.523e-01   2.562 0.010394 *  
## red_total_gold                           NA         NA      NA       NA    
## red_avg_level                    -1.738e-02  1.828e-01  -0.095 0.924228    
## red_total_experience                     NA         NA      NA       NA    
## red_total_minions_killed          5.409e-03  1.924e-03   2.812 0.004925 ** 
## red_total_jungle_minions_killed   6.230e-03  3.326e-03   1.873 0.061049 .  
## red_gold_diff                            NA         NA      NA       NA    
## red_experience_diff                      NA         NA      NA       NA    
## red_cs_per_min                           NA         NA      NA       NA    
## red_gold_per_min                         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13695  on 9878  degrees of freedom
## Residual deviance: 10411  on 9853  degrees of freedom
## AIC: 10463
## 
## Number of Fisher Scoring iterations: 4

The results from model1 tell us we have a few problems. First, we are seeing NAs because there were singularities. The reason for the singularity may be due to multi-collinearity (two or more variables are extremely highly correlated) or a lack of variability. If we refer back to the data distributions, we can see that, for example, blue_gold_diff and blue_experience_diff are both very normally distributed. When you assess the correlation between these two variables, we find an r = .89. This is pretty high and may be the cause of the singularity issue. Another possible cause may be, for example, the number of games in which blue_heralds (i.e., the number of heralds killed by the blue team) = 0 is extremely high compared to the number of games in which blue_heralds = 1.

For the purposes of this exploration, I’m not going to dive into the singularity issue.

The next thing to do is perform do the stepwise regression.

step1 <- stepAIC(model1)

NOTE: I suppressed the results from the stepAIC function because it shows the evaluations of the model combinations (it’s a lot of output).

summary(step1)

## 
## Call:
## glm(formula = blue_wins ~ blue_wards_placed + blue_first_blood + 
##     blue_kills + blue_deaths + blue_dragons + blue_towers_destroyed + 
##     blue_total_minions_killed + blue_gold_diff + blue_experience_diff + 
##     red_elite_monsters + red_dragons + red_towers_destroyed + 
##     red_total_minions_killed + red_total_jungle_minions_killed, 
##     family = binomial, data = lol)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6989  -0.8709  -0.1405   0.8691   2.7706  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -4.469e-01  4.564e-01  -0.979 0.327449    
## blue_wards_placed               -2.189e-03  1.326e-03  -1.651 0.098772 .  
## blue_first_blood                 8.204e-02  5.216e-02   1.573 0.115757    
## blue_kills                      -4.729e-02  2.019e-02  -2.343 0.019153 *  
## blue_deaths                      4.748e-02  2.119e-02   2.240 0.025068 *  
## blue_dragons                     3.663e-01  6.438e-02   5.689 1.27e-08 ***
## blue_towers_destroyed           -1.940e-01  1.298e-01  -1.495 0.135009    
## blue_total_minions_killed       -4.314e-03  1.575e-03  -2.740 0.006151 ** 
## blue_gold_diff                   4.951e-04  4.449e-05  11.128  < 2e-16 ***
## blue_experience_diff             2.767e-04  3.200e-05   8.647  < 2e-16 ***
## red_elite_monsters              -1.023e-01  6.801e-02  -1.504 0.132556    
## red_dragons                     -1.775e-01  9.427e-02  -1.883 0.059760 .  
## red_towers_destroyed             3.908e-01  1.397e-01   2.798 0.005145 ** 
## red_total_minions_killed         5.043e-03  1.524e-03   3.308 0.000939 ***
## red_total_jungle_minions_killed  5.583e-03  2.884e-03   1.936 0.052856 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13695  on 9878  degrees of freedom
## Residual deviance: 10417  on 9864  degrees of freedom
## AIC: 10447
## 
## Number of Fisher Scoring iterations: 4

We can see from the results that the stepAIC function found a model with 15 predictors that are associated with blue_wins.

Within the summary of that model, we can see that there are four variables that are highly predictive of blue_wins.

So let’s make a model with those and get in some visualizations.

model2 <- glm(blue_wins ~ blue_dragons + blue_gold_diff + blue_experience_diff + red_total_minions_killed, family = binomial, data = lol)
summary(model2)

## 
## Call:
## glm(formula = blue_wins ~ blue_dragons + blue_gold_diff + blue_experience_diff + 
##     red_total_minions_killed, family = binomial, data = lol)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7518  -0.8788  -0.1381   0.8764   2.7186  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.834e-01  2.698e-01  -2.904  0.00369 ** 
## blue_dragons              5.366e-01  5.011e-02  10.709  < 2e-16 ***
## blue_gold_diff            4.102e-04  2.334e-05  17.572  < 2e-16 ***
## blue_experience_diff      2.488e-04  2.854e-05   8.715  < 2e-16 ***
## red_total_minions_killed  2.701e-03  1.234e-03   2.189  0.02863 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13695  on 9878  degrees of freedom
## Residual deviance: 10457  on 9874  degrees of freedom
## AIC: 10467
## 
## Number of Fisher Scoring iterations: 4

Looking at the summary from model2, we see that actually red_total_minions_killed is no longer as highly predictive of blue_wins as it was when the other variables were included in model1. One possible/probable reason for this is excluding certain variables alters covariance.

What we can conclude from this basic look is that the number of dragons killed by the blue team (blue_dragons), the difference in gold between the blue team and red team (blue_gold_diff), the difference in experience between the blue team and red team (blue_experience_diff), and the total number of minions killed by the red team (red_total_minions_killed) were all important factors in determining if the blue team won.

ROC Curve

A ROC curve is an nice method to assess your model’s performance. The end goal of this visualization will be to extract the AUC (area under the curve), which indicates how predictive the model is. Let’s go!

m2_roc <- roc(blue_wins~model2$fitted.values, data = lol, plot = TRUE, main = "LOL ROC Curve", col = "blue")

m2_roc

## 
## Call:
## roc.formula(formula = blue_wins ~ model2$fitted.values, data = lol,     plot = TRUE, main = "LOL ROC Curve", col = "blue")
## 
## Data: model2$fitted.values in 4949 controls (blue_wins Red Wins) < 4930 cases (blue_wins Blue Wins).
## Area under the curve: 0.8103

The AUC shows us that our ability of distinguishing classes, in this case a blue win, based on the specified parameters is about .81.

Keep in mind, I used a simplified model and one that was not based on the Stepwise regression. This means that we could have an even higher AUC value and better predictive ability. But again, for the sake of parsimony and time, I chose to only look at those variables that seemed to have the greatest predictive power.

I hope you have found this exploration helpful!

Playing with League of Legend (LOL) Data