Portfolio

Diving into League of Legends Data

Author

Yogit Verma

Data Introduction

For this portfolio, I have used data from Professional League of Legends games held in 2024. League of Legends is one of the most popular esports video games. I chose this data because I have worked on it before and wanted to continue my work on it.

Game Introduction

This game has 5 players on each team. Before the start of the game, the players get to choose a character that they would like to play. Each person is also assigned a specific role in the game. There are three lanes on the map of League of Legends. The lanes are called Top, Middle, and Bottom because of their position. The objective to the game is to destroy the structures of the enemy team in all the lanes and then destroy the final structure, which is called the “Nexus”. Players have to earn gold and buy items in order to get stronger and accomplish their goal for the game. Players can earn gold by killing minions, monsters that are unplayable characters, or by killing the characters controlled by enemy players. More gold means that the chances of winning are higher, but the players still have to strategize and plan properly to win the game.

Game Description

Course Objectives

When I started on my journey with this course. I was sure that there would be an opportunity for me to learn more about new rules and formulas, which I did. But what I did not expect was that I would also learn how to observe, think, and analyze like a statistician. There are so many things in statistics that go against the normal instincts of an average human being. It is because we have developed those instincts based on the biases that we have created from our experiences.

Principal Component Regression

Many players enjoy playing League of Legends, and most want to improve their skills and contribute to their team’s success. It can be challenging for players to identify the key aspects of the game they need to focus on, as the game is quite complex. This part of the analysis and modeling focuses on Individual data from players. As mentioned on the home page, earning gold in this game is the best way to increase the chances of winning the game; therefore, I am just going to create a model to find out what variables affect the amount of gold earned in a game.

source("globals.R")

For PCR, I selected the variables that made sense to me, based on my knowledge, and the variables that did not have a lot of 0 values or missing values. Running the regsubsets function here is also really good for us because it allows me to see the order in which I should add variables in my model.

pcr_players_data <- players_data|>
  select(damagetochampions, totalgold, kills, deaths, assists, wardskilled, wardsplaced, visionscore, position, total.cs)
library(leaps)
regfit.full <- regsubsets(totalgold ~ ., pcr_players_data, nvmax = 9)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(totalgold ~ ., pcr_players_data, nvmax = 9)
9 Variables  (and intercept)
                  Forced in Forced out
damagetochampions     FALSE      FALSE
kills                 FALSE      FALSE
deaths                FALSE      FALSE
assists               FALSE      FALSE
wardskilled           FALSE      FALSE
wardsplaced           FALSE      FALSE
visionscore           FALSE      FALSE
positionNon-Laner     FALSE      FALSE
total.cs              FALSE      FALSE
1 subsets of each size up to 9
Selection Algorithm: exhaustive
         damagetochampions kills deaths assists wardskilled wardsplaced
1  ( 1 ) " "               " "   " "    " "     " "         " "        
2  ( 1 ) " "               " "   " "    " "     " "         " "        
3  ( 1 ) " "               "*"   " "    " "     " "         " "        
4  ( 1 ) " "               "*"   " "    "*"     " "         " "        
5  ( 1 ) "*"               "*"   " "    "*"     " "         " "        
6  ( 1 ) "*"               "*"   " "    "*"     "*"         " "        
7  ( 1 ) "*"               "*"   " "    "*"     "*"         "*"        
8  ( 1 ) "*"               "*"   " "    "*"     "*"         "*"        
9  ( 1 ) "*"               "*"   "*"    "*"     "*"         "*"        
         visionscore positionNon-Laner total.cs
1  ( 1 ) " "         " "               "*"     
2  ( 1 ) "*"         " "               "*"     
3  ( 1 ) "*"         " "               "*"     
4  ( 1 ) "*"         " "               "*"     
5  ( 1 ) "*"         " "               "*"     
6  ( 1 ) "*"         " "               "*"     
7  ( 1 ) "*"         " "               "*"     
8  ( 1 ) "*"         "*"               "*"     
9  ( 1 ) "*"         "*"               "*"     
reg.sum <- summary(regfit.full)
max_adjr2 <-which.max(reg.sum$adjr2)
plot(reg.sum$adjr2, xlab = "Number of Variables",
    ylab = "Adjusted RSq", type = "l")
points(max_adjr2, reg.sum$adjr2[max_adjr2], col = "red", cex = 2, 
    pch = 20)

x <- model.matrix(totalgold ~ ., pcr_players_data)[, -1]
y <- pcr_players_data$totalgold
pcr.fit <- pcr(totalgold ~ ., data = pcr_players_data, scale = TRUE,
    validation = "CV")
summary(pcr.fit)
Data:   X dimension: 29190 9 
    Y dimension: 29190 1
Fit method: svdpc
Number of components considered: 9

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            3406     2670     1495     1428     1428     1405     1375
adjCV         3406     2670     1495     1428     1428     1405     1375
       7 comps  8 comps  9 comps
CV        1370    854.7    831.3
adjCV     1370    854.6    831.2

TRAINING: % variance explained
           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X            43.28    63.16    74.52    83.64    89.60    95.36    98.19
totalgold    38.58    80.73    82.44    82.44    82.99    83.70    83.82
           8 comps  9 comps
X            99.74   100.00
totalgold    93.71    94.05
validationplot(pcr.fit, val.type = "MSEP")

set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(pcr_players_data),
    replace = TRUE)
test <- (!train)
y.test <- y[test]
pcr.fit <- pcr(totalgold ~ ., data = pcr_players_data, subset = train,
    scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

Based on the graph here, we can see that the best number of components to select would be 2 or 8. That conclusion can be made because after 2 the drop in MSEP is quite low, and the same can be seen after 8, where there is a steep drop after 7, but that drop gets very negligible after 8.

pcr.pred <- predict(pcr.fit, x[test, ], ncomp = 8)
mean((pcr.pred - y.test)^2)
[1] 725780

Linear Regression

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

lm_spec
Linear Regression Model Specification (regression)

Computational engine: lm 

For the first model, I tried the 8 variables that were recommended earlier, where the position of the variable is the categorical variable. I have reduced the position variable into two categories of Laner ( Middle Lane, Top Lane, Bottom Lane)and Non-Laner (Jungle and Support).

gold_mod1 <- lm_spec |>
  fit( totalgold ~ (total.cs + visionscore +  kills + assists + damagetochampions + wardskilled +  wardsplaced)*position ,data = players_data)

glance(gold_mod1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
1     0.942         0.942  819.    31742.       0    15 -237206. 474447. 474588.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(gold_mod1)
# A tibble: 16 × 5
   term                                  estimate std.error statistic   p.value
   <chr>                                    <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                         1115.       30.0       37.1    1.78e-294
 2 total.cs                              30.7       0.115    267.     0        
 3 visionscore                           42.0       0.937     44.9    0        
 4 kills                                345.        2.73     126.     0        
 5 assists                              118.        1.93      61.3    0        
 6 damagetochampions                      0.0250    0.00100   25.0    4.65e-136
 7 wardskilled                          -41.6       1.79     -23.2    5.95e-118
 8 wardsplaced                            3.76      1.67       2.25   2.42e-  2
 9 positionNon-Laner                   1368.       43.9       31.2    6.65e-210
10 total.cs:positionNon-Laner            -2.99      0.213    -14.0    1.11e- 44
11 visionscore:positionNon-Laner          0.0371    1.31       0.0283 9.77e-  1
12 kills:positionNon-Laner              -29.3       5.69      -5.15   2.62e-  7
13 assists:positionNon-Laner             -4.54      2.55      -1.78   7.48e-  2
14 damagetochampions:positionNon-Laner   -0.00382   0.00202   -1.90   5.79e-  2
15 wardskilled:positionNon-Laner         -0.967     2.68      -0.360  7.19e-  1
16 wardsplaced:positionNon-Laner        -22.6       2.06     -11.0    4.54e- 28

The categorical variable here is interactions with all the numeric variables, but after looking at the model, I decided that I can go ahead and remove most of the interactions and only keep the extremely significant ones. The model is already explaining more than 94% variability here, and reducing the size of the model will be really helpful here.

We want to test out the interaction term because there are times when a categorical and a numeric variable do not make sense individually, but they explain a lot of variability and become significant when they come together. Here, I even removed a couple of significant interaction terms to reduce the size of the model.

gold_mod3 <- lm_spec |>
  fit( totalgold ~ total.cs*position + visionscore +  kills + assists + damagetochampions + wardskilled +  wardsplaced*position ,data = players_data)

glance(gold_mod3)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
1     0.942         0.942  819.    47520.       0    10 -237236. 474495. 474595.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(gold_mod3)
# A tibble: 11 × 5
   term                           estimate std.error statistic   p.value
   <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                   1124.     29.2          38.4  2.19e-315
 2 total.cs                        30.8     0.107       287.   0        
 3 positionNon-Laner             1352.     41.4          32.7  2.78e-230
 4 visionscore                     41.8     0.652        64.1  0        
 5 kills                          340.      2.38        143.   0        
 6 assists                        115.      1.26         91.8  0        
 7 damagetochampions                0.0246  0.000864     28.5  6.32e-176
 8 wardskilled                    -41.3     1.33        -31.1  9.29e-209
 9 wardsplaced                      4.77    1.29          3.70 2.20e-  4
10 total.cs:positionNon-Laner      -3.62    0.160       -22.6  2.75e-112
11 positionNon-Laner:wardsplaced  -23.8     0.940       -25.3  4.69e-140

In the iteration, less than 0.001 adjusted R² was lost, but I decided to keep this model for now. There were a lot of other linear models that were used here, but I am only keeping the most important one.

gold_mod <- lm_spec |>
  fit( totalgold ~ total.cs + visionscore +  kills + assists + damagetochampions + wardskilled +  wardsplaced + position ,data = players_data)


glance(gold_mod)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
1     0.940         0.940  832.    57445.       0     8 -237696. 475413. 475496.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(gold_mod) %>% 
  knitr::kable()
term estimate std.error statistic p.value
(Intercept) 1611.9434452 24.7225603 65.20132 0
total.cs 29.6681053 0.0972094 305.19783 0
visionscore 43.2877494 0.6584523 65.74167 0
kills 336.3066375 2.4173056 139.12458 0
assists 114.7573630 1.2769241 89.87015 0
damagetochampions 0.0281584 0.0008671 32.47455 0
wardskilled -42.3655734 1.3459800 -31.47563 0
wardsplaced -14.1298130 0.8937534 -15.80952 0
positionNon-Laner 250.8048397 16.2975510 15.38911 0

The model without any interaction was also showing good variability.

After being satisfied with the linear models. I moved on to working with the polynomial model.

I started by adding variables in the same order as I had seen them in regsubsets. I tried out different exponent values for total.cs and using its interaction, I tried adding visionscore and see what exponent values would work best that variable as well.

Polynomial Regression

poly_gold1 <- lm(totalgold ~ poly(total.cs, 2, raw = T)*position + visionscore, data = player_train)

glance(poly_gold1)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
1     0.835         0.835 1382.    17194.       0     6 -176749. 353514. 353577.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(poly_gold1)
# A tibble: 7 × 5
  term                                    estimate std.error statistic   p.value
  <chr>                                      <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)                              4.59e+3 119.          38.6  2.45e-314
2 poly(total.cs, 2, raw = T)1              1.71e+1   0.809       21.1  9.44e- 98
3 poly(total.cs, 2, raw = T)2              3.26e-2   0.00143     22.8  2.22e-113
4 positionNon-Laner                       -1.39e+3 119.         -11.6  3.98e- 31
5 visionscore                              3.29e+1   0.414       79.5  0        
6 poly(total.cs, 2, raw = T)1:positionNo…  9.24e+0   1.06         8.69 3.85e- 18
7 poly(total.cs, 2, raw = T)2:positionNo… -6.79e-3   0.00293     -2.32 2.05e-  2

With just 3 variables, we had a model that was explaining 83% variability.

poly_gold2 <- lm(totalgold ~(total.cs^2)*position + visionscore+ kills + assists , data = player_train)

glance(poly_gold2)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
1     0.936         0.936  859.    49980.       0     6 -167020. 334056. 334119.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(poly_gold2, conf.int = TRUE)
# A tibble: 7 × 7
  term                  estimate std.error statistic  p.value conf.low conf.high
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)            1484.      33.0        44.9 0         1419.     1549.  
2 total.cs                 31.5      0.115     273.  0           31.3      31.7 
3 positionNon-Laner       740.      42.3        17.5 4.39e-68   658.      823.  
4 visionscore              28.0      0.242     116.  0           27.5      28.4 
5 kills                   371.       2.64      140.  0          365.      376.  
6 assists                 130.       1.51       86.1 0          127.      133.  
7 total.cs:positionNon…    -2.87     0.180     -15.9 1.04e-56    -3.22     -2.51

For the last model. I added kills and assists to the model. I also tried to add their exponents and log value, but the changes were not substantial enough; therefore, I chose to keep them in their original form in the model.

This is the final model I chose to work with.

Model assessment

We can also see the confidence interval above. Since Stats deals with uncertainty we cannot definitely say what the actual coefficients for this model should be but the confidence intervals provide us the rage in which these coeffiecients will fall under. For example, At 95% confidence interval we can conclude that the coefficient for visonscore will be between 27.478 and 28.42.

poly_gold_aug <- augment(poly_gold2, newdata = player_test)



poly_gold_aug|>
  gf_point(.resid~.fitted) |>
  gf_hline(yintercept = 0, lty="dashed",col='red') |>
  gf_labs(y='Residuals',x='Fitted values')

poly_gold_aug|>
    gf_qq(~.resid) |>
  gf_qqline()

Residual Analysis here shows a lot of instances where the normality is not being followed by the residuals. This means that the model is very problematic and we need to work with a variables to make it residuals follow normal distribution. This model cannot be ignored because it still explains a lot of variability in the data but it still needs more fine tuning to be done with it.

boot.fn <- function(data, index)
  coef(
      lm(totalgold ~(total.cs^2)*position + visionscore+ kills + assists , 
        data = data, subset = index)
    )
set.seed(90)
boot(player_train, boot.fn, 1000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = player_train, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
       original       bias    std. error
t1* 1483.923685 -1.557350096  42.2453800
t2*   31.501545  0.004007909   0.1649853
t3*  740.494868  1.861447052  62.6367087
t4*   27.951399 -0.007977034   0.3314261
t5*  370.537891  0.030271310   3.1664675
t6*  130.114273  0.092362844   1.5048566
t7*   -2.866021 -0.005463348   0.2485562
summary(
    lm(totalgold ~(total.cs^2)*position + visionscore+ kills + assists , data = player_train)
  )$coef
                              Estimate Std. Error   t value     Pr(>|t|)
(Intercept)                1483.923685 33.0130640  44.94959 0.000000e+00
total.cs                     31.501545  0.1153967 272.98483 0.000000e+00
positionNon-Laner           740.494868 42.3106717  17.50137 4.392032e-68
visionscore                  27.951399  0.2417388 115.62647 0.000000e+00
kills                       370.537891  2.6447327 140.10410 0.000000e+00
assists                     130.114273  1.5105960  86.13440 0.000000e+00
total.cs:positionNon-Laner   -2.866021  0.1800464 -15.91824 1.038760e-56

After bootstrapping, here we can see a clearer picture of the tour intercept and coefficients. Bootstraping allows us to perform our modeling on a lot of combinations, which gives us better chances of improving our model. More bootstrapping allows us to have the maximum chance of finding the optimal value for our coefficients.

Based on the above information \[ totalgold = 1483.92 + 31.50 \times total.cs + 740.49 \times positonNon-Laner + 27.95 \times visionscore + 370.54 \times kills + 130.11 \times assists - 2.87\times total.cs:positonNon-Laner + \varepsilon \] This means that if everything is constant and total.cs is increased by 1, then the totalgold will increase by 31.50. And that will be the same with all the other variables and their respective coefficients.

Discriminant Analysis

I have talked earlier about how there are different roles and characters for different players. These roles require different tasks to be done by the player and for that reason, we can assume that these players will have different damage numbers and gold earnings. To look into that, I decided to make a discriminant analysis model here to see if we can seperate people based on their roles or position.

library(discrim)

discrim_quad() |>
  set_mode("classification") |>
  set_engine("MASS") |>
  translate()
Quadratic Discriminant Model Specification (classification)

Computational engine: MASS 

Model fit template:
MASS::qda(formula = missing_arg(), data = missing_arg())
role.qda.fit <- discrim_quad() |>
  fit(role ~ kills + deaths + assists + totalgold + visionscore+damagetochampions + total.cs, data = player_train)

role.qda.fit
parsnip model object

Call:
qda(role ~ kills + deaths + assists + totalgold + visionscore + 
    damagetochampions + total.cs, data = data)

Prior probabilities of groups:
Bottom Lane      Jungle Middle Lane     Support    Top Lane 
  0.1967895   0.1983067   0.2002153   0.2020751   0.2026134 

Group means:
                kills   deaths  assists totalgold visionscore damagetochampions
Bottom Lane 4.1178811 2.288237 5.333499 13932.389    46.05098         22061.439
Jungle      2.3889437 2.849704 7.484205 11080.245    51.54911         12461.545
Middle Lane 3.5978978 2.386214 5.615253 13593.687    40.75483         21637.403
Support     0.7846936 3.204650 9.098087  8224.474   111.91378          6282.042
Top Lane    2.6572464 2.806522 4.843237 12462.694    34.22512         16823.743
             total.cs
Bottom Lane 287.72768
Jungle      194.67078
Middle Lane 289.47250
Support      48.16275
Top Lane    256.46304

As we can see that there is a difference in means among different roles here for all the variables. This also shows that there is almost equal probability of for a player to belong to any group here. Based on the difference in means I feel strongly about the assessment of this model.

role.qda.class <- predict(extract_fit_engine(role.qda.fit), player_test)$class


table(role.qda.class,player_test$role)
              
role.qda.class Bottom Lane Jungle Middle Lane Support Top Lane
   Bottom Lane         390    152         208      18       58
   Jungle               37   1435          38      60      120
   Middle Lane         862     69         851       1      305
   Support              58     16           0    1582        0
   Top Lane            470    114         650      48     1215
mean(role.qda.class == player_test$role)
[1] 0.6249857

The accuracy of 62% might seem low without context but when we consider that the chancing of guessing the correct role for a random player will be 1/5 then we can start to appreciate how well this model is doing. The classification models are a perfect way to see how Statistics can turn complete randomness into informed randomness and increases our judgement based on that.

position.qda.fit <- discrim_quad() %>%
  fit(position ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train)

position.qda.fit
parsnip model object

Call:
qda(position ~ kills + deaths + assists + totalgold + visionscore + 
    damagetochampions + total.cs, data = data)

Prior probabilities of groups:
    Laner Non-Laner 
0.5996183 0.4003817 

Group means:
             kills   deaths  assists totalgold visionscore damagetochampions
Laner     3.450702 2.496082 5.261916  13322.68    40.28657         20150.013
Non-Laner 1.579269 3.028847 8.298741   9638.92    82.01552          9342.713
          total.cs
Laner     277.7458
Non-Laner 120.7273

Based on the earlier Group means I decided to convert the roles into two categories of “Laners” and “Non-Laners”. I imagined that this will be an even more accurate model since we can see the differences in means all across the board here.

pos.qda.class = predict(extract_fit_engine(position.qda.fit), player_test)$class
table(pos.qda.class, player_test$position)
             
pos.qda.class Laner Non-Laner
    Laner      5004       895
    Non-Laner   258      2600
mean(pos.qda.class == player_test$position)
[1] 0.8683339

The same model has even higher accuracy when deciding between the two categories as one would expect. Once again if were to guess this randomly then our odds of guessing the right one will be 50% if we know nothing about the data (including the prior probabilities).

I tried this same model again with lda just to see what I can get here

discrim_linear() |>
  set_mode("classification") |>
  set_engine("MASS") |>
  translate()
Linear Discriminant Model Specification (classification)

Computational engine: MASS 

Model fit template:
MASS::lda(formula = missing_arg(), data = missing_arg())
pos.lda <- discrim_linear()|>
  fit(position ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train)


pos.lda
parsnip model object

Call:
lda(position ~ kills + deaths + assists + totalgold + visionscore + 
    damagetochampions + total.cs, data = data)

Prior probabilities of groups:
    Laner Non-Laner 
0.5996183 0.4003817 

Group means:
             kills   deaths  assists totalgold visionscore damagetochampions
Laner     3.450702 2.496082 5.261916  13322.68    40.28657         20150.013
Non-Laner 1.579269 3.028847 8.298741   9638.92    82.01552          9342.713
          total.cs
Laner     277.7458
Non-Laner 120.7273

Coefficients of linear discriminants:
                            LD1
kills             -2.987363e-02
deaths             4.899846e-02
assists            5.621378e-02
totalgold          1.299493e-04
visionscore        6.559287e-03
damagetochampions -5.085047e-05
total.cs          -1.105491e-02

This model also shows us the coefficents of Linear Discriminants which shows us the weight that has been allocated to each varaible in this model.

pos.lda.class = predict(extract_fit_engine(pos.lda), player_test)$class
table(pos.lda.class, player_test$position)
             
pos.lda.class Laner Non-Laner
    Laner      5116      1025
    Non-Laner   146      2470
mean(pos.lda.class == player_test$position)
[1] 0.8662784

We can see here that there is not a huge difference between linear and quadrilateral Discriminant Analysis here but because of simplicity if I were to choose a specific Model then I will chose the linear model.

Logistic Regression

When looking from team perspective, we can assume that all teams want to win in this game and for that reason I decided to look at the result variable for the game.

Here I once again picked the variables that seemed useful to me based on my knowledge and then started using p-values and other stats to start removing the variables that were not good for the model.

teams_data |>
  ggplot(aes(x = result, y = totalgold ,fill = result))+
  geom_boxplot(fill = c("navy", "maroon"))+
  scale_y_continuous(expand = c(0,0))

teams_data |>
  ggplot(aes(x = result, y = teamkills ,fill = result))+
  geom_boxplot(fill = c("navy", "maroon"))+
  scale_y_continuous(expand = c(0,0))

We can clearly see the difference between average totalkills in victory games and Defeats game. That seemed like a great way to start the logistical regression here.

log_reg <- logistic_reg()|>
  set_engine("glm")


result_mod <- log_reg|>
  fit(result ~ teamkills + teamdeaths + gamelength + totalgold +dragons + barons+ wardskilled + visionscore +wardskilled +leaguetype, data = team_train, family = "binomial" )

tidy(result_mod) %>% 
  knitr::kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -1.060 0.817 -1.298 0.194
teamkills 0.268 0.036 7.478 0.000
teamdeaths -0.503 0.033 -15.342 0.000
gamelength -0.020 0.002 -11.707 0.000
totalgold 0.001 0.000 12.284 0.000
dragons 0.219 0.112 1.962 0.050
barons -0.233 0.179 -1.298 0.194
wardskilled 0.015 0.010 1.472 0.141
visionscore -0.008 0.004 -1.908 0.056
leaguetypeInternational 0.696 0.556 1.253 0.210

I had to do trial and error here to remove many variables that were giving me a really high P-value and there were some variables that were not giving me high p-value for themselves but keeping them in the model was giving me high p-value for the intercept and I had to clean all of those variables from the model. After a lot of trials and errors I got to the model below.

result_mod1 <- log_reg|>
  fit(result ~ teamkills + wardskilled + visionscore  , data = team_train, family = "binomial" )

glance(result_mod1)
# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         5664.    4085 -1556. 3119. 3144.    3111.        4082  4086
tidy(result_mod1) %>% 
  knitr::kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) -4.242 0.204 -20.785 0
teamkills 0.372 0.011 32.927 0
wardskilled 0.018 0.004 4.324 0
visionscore -0.006 0.001 -5.290 0

\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -4.242 + 0.372 \times (teamkills) + 0.018 \times (wardskilled) - 0.006 \times (visionscore) \]

This shows that For each additional each extra teamkill in a game, we expect the log odds the team winning to increase by .372. Assuming that no other variables are changed.

\[ \text{odds} = e^{\log(\text{odds})} \]

# To store residuals and create row number variable
result_aug <- augment(extract_fit_engine(result_mod1),new_data = train_team, type.predict = "response", 
                      type.residuals = "deviance") |>
                      mutate(id = row_number())



# Plot residuals vs fitted values
ggplot(data = result_aug, aes(x = .fitted, y = .resid, color = result)) + 
geom_point() + 
geom_hline(yintercept = 0, color = "red") + 
labs(x = "Fitted values", 
     y = "Deviance residuals", 
     title = "Deviance residuals vs. fitted")

# Plot residuals vs row number
ggplot(data = result_aug, aes(x = id, y = .resid, color = result)) + 
geom_point() + 
geom_hline(yintercept = 0, color = "red") + 
labs(x = "id", 
     y = "Deviance residuals", 
     title = "Deviance residuals vs. id")

The residual analysis shows a nice spread in the residual against the id column which shows that the variability in residuals is there. There are not a lot of problems with extreme values. There is a huge gap between victory the points when we looking at the fitted value graph as it is expected in this kind of dataset.

Prediction Accuracy

result_glm <- glm(result ~ teamkills + wardskilled + visionscore , data = team_train, family = "binomial" )

pred_prob <- predict(result_glm, newdata = team_test, type = "response")

pred_result <- ifelse(pred_prob > 0.5, "Victory", "Defeat")

table(pred_result)
pred_result
 Defeat Victory 
    902     850 
mean(team_train$result == pred_result)
Warning in `==.default`(team_train$result, pred_result): longer object length
is not a multiple of shorter object length
Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
shorter object length
[1] 0.5073421

I decided to check the response here as well to see how accurate the model was and the accuracy around 50% is very concerning here. This model suggests that our model is almost as good as guessing the winner of the game. In a way, this shows that the game is really unpredictable and there is an opportunity for people to make a come back from behind in any scenario.

position_glm <- glm(position ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train, family = "binomial")

position_glm

Call:  glm(formula = position ~ kills + deaths + assists + totalgold + 
    visionscore + damagetochampions + total.cs, family = "binomial", 
    data = player_train)

Coefficients:
      (Intercept)              kills             deaths            assists  
        1.753e+00          6.768e-02          1.522e-01          1.926e-01  
        totalgold        visionscore  damagetochampions           total.cs  
       -6.482e-07          4.283e-02         -1.671e-04         -1.637e-02  

Degrees of Freedom: 20432 Total (i.e. Null);  20425 Residual
Null Deviance:      27510 
Residual Deviance: 11010    AIC: 11020
pred_prob <- predict(position_glm, newdata = player_test, type = "response")

pred_position <- ifelse(pred_prob > 0.5, "Non-Laner", "Laner")

table(pred_result)
pred_result
 Defeat Victory 
    902     850 
mean(player_test$position == pred_position)
[1] 0.9086445

I decided to try the logistic regression model with the same parameters that I had used for Quadratic Discriminant Analysis to determine wether a player was a laner or not. It is really interesting to see that the logistic regression model performs better than QDA and LDA above.
This logistic regression model shows that the log odd of a person being a “Non-Laner” increase by .06 with each extra kill, if everything else remains similar.

Multinomial Logsitic Regression

After seeing how good the Logistic Regression model was for Identifying wether a person was a “Laner” or “Non-Laner”. I am going to see if I can use a mulitinomial Logistic Regression Model to see if I can identify what position a person plays: “Top Lane”, “Middle Lane”, “Bottom Lane”, “Jungler” or “Support”.

multi_role_mod <- multinom_reg() |>
  set_engine("nnet") |> 
  fit(role ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train)

tidy(multi_role_mod) |>
  print(n = Inf)
# A tibble: 32 × 6
   y.level     term                  estimate  std.error    statistic   p.value
   <chr>       <chr>                    <dbl>      <dbl>        <dbl>     <dbl>
 1 Jungle      (Intercept)        3.06        0.00000428  714664.     0        
 2 Jungle      kills             -0.0854      0.000139      -614.     0        
 3 Jungle      deaths             0.219       0.0000388     5654.     0        
 4 Jungle      assists            0.210       0.000173      1213.     0        
 5 Jungle      totalgold          0.000263    0.0000224       11.8    6.78e- 32
 6 Jungle      visionscore        0.0109      0.00175          6.24   4.31e- 10
 7 Jungle      damagetochampions -0.000184    0.00000549     -33.4    4.33e-245
 8 Jungle      total.cs          -0.0225      0.000861       -26.1    4.19e-150
 9 Middle Lane (Intercept)       -0.0399      0.00000145  -27467.     0        
10 Middle Lane kills             -0.0104      0.0000817     -127.     0        
11 Middle Lane deaths             0.0477      0.0000184     2589.     0        
12 Middle Lane assists            0.0961      0.0000768     1251.     0        
13 Middle Lane totalgold         -0.000185    0.0000171      -10.8    2.04e- 27
14 Middle Lane visionscore       -0.0240      0.00167        -14.3    1.50e- 46
15 Middle Lane damagetochampions  0.000000158 0.00000335       0.0473 9.62e-  1
16 Middle Lane total.cs           0.0105      0.000666        15.8    1.77e- 56
17 Support     (Intercept)        5.01        0.00000251 1999788.     0        
18 Support     kills             -0.287       0.0000656    -4367.     0        
19 Support     deaths             0.289       0.0000557     5196.     0        
20 Support     assists            0.298       0.000303       980.     0        
21 Support     totalgold          0.000172    0.0000431        3.98   6.82e-  5
22 Support     visionscore        0.0404      0.00273         14.8    1.65e- 49
23 Support     damagetochampions -0.000385    0.0000130      -29.7    3.32e-194
24 Support     total.cs          -0.0416      0.00168        -24.8    2.36e-135
25 Top Lane    (Intercept)        1.27        0.0000100   126657.     0        
26 Top Lane    kills             -0.408       0.000140     -2909.     0        
27 Top Lane    deaths             0.129       0.0000853     1513.     0        
28 Top Lane    assists           -0.0125      0.000132       -94.6    0        
29 Top Lane    totalgold          0.000818    0.0000190       43.2    0        
30 Top Lane    visionscore       -0.0746      0.00204        -36.6    5.13e-293
31 Top Lane    damagetochampions -0.0000663   0.00000409     -16.2    4.83e- 59
32 Top Lane    total.cs          -0.0247      0.000721       -34.3    2.72e-257

In Multinomial Logistic Regression we have multiple equations based on the number of possible outcomes so if we were to take only position into consideration then the solution will look something like the following. I am going to take Jungle as an example. We calculate the odds of a person playing the Jungle role against people playing every other role. \[ \begin{align*} \log\left(\frac{\hat{p}_{\text{jungle}}}{\hat{p}_{\text{others}}}\right) &= 3.06 \\ &- 8.54 \times \text{kills} \\ &+ 2.19 \times \text{deaths} \\ &+ 2.10 \times \text{assists} \\ &+ 2.63 \times \text{totalgold} \\ &+ 1.09 \times \text{visionscore} \\ &- 1.83 \times \text{damagetochampions} \\ &- 2.25 \times \text{total.cs} \end{align*} \]

Based on the data here we can say that the if a player’s kills increased by one then the log odds of that player being a jungle is reduced by 8.54. While everything else is constant.

Model Performance

role_aug <- augment(multi_role_mod, new_data = player_test)

role_aug
# A tibble: 8,757 × 20
   .pred_class `.pred_Bottom Lane` .pred_Jungle `.pred_Middle Lane`
   <fct>                     <dbl>        <dbl>               <dbl>
 1 Top Lane               0.200          0.0550          0.360     
 2 Top Lane               0.188          0.0268          0.263     
 3 Top Lane               0.0995         0.0461          0.236     
 4 Bottom Lane            0.315          0.127           0.294     
 5 Support                0.000228       0.0834          0.00000641
 6 Bottom Lane            0.368          0.0986          0.312     
 7 Jungle                 0.00411        0.644           0.00616   
 8 Top Lane               0.239          0.136           0.288     
 9 Top Lane               0.163          0.0232          0.263     
10 Bottom Lane            0.413          0.115           0.349     
# ℹ 8,747 more rows
# ℹ 16 more variables: .pred_Support <dbl>, `.pred_Top Lane` <dbl>,
#   teamname <chr>, champion <chr>, playername <chr>, kills <int>,
#   deaths <int>, assists <int>, wardskilled <int>, wardsplaced <int>,
#   visionscore <int>, position <fct>, total.cs <int>, totalgold <int>,
#   damagetochampions <int>, role <fct>
table(role_aug$.pred_class, role_aug$role)
             
              Bottom Lane Jungle Middle Lane Support Top Lane
  Bottom Lane         534    214         485       1      211
  Jungle              147   1424          67      96      205
  Middle Lane         703     26         718       2      254
  Support              41     11           0    1581        0
  Top Lane            392    111         477      29     1028
role_aug|> 
  ggplot(aes(x = role, fill = .pred_class))+
  geom_bar()

The Graph here shows tht Support and jungle were the easiest to tell apart while other laning positions were more likely to be confused with one another. This also explains why our earlier model performed so well. This model shows show laning positions are the ones that are most difficult to tell apart and that is where most of our inaccurate predictions were coming from. Looking at Bottom Lane we can see that Bottom Lane is more likely to be predicted as Middle Lane than it is Bottom Lane and Middle Lane was getting predicted as other lanes quite often. This shows how all the laning positions are quite similar and Top Lane position might be the one that is the most different out of the Laning roles while Middle Lane and Bottom Lane are extremely close to one another in playstyle.

mean(role_aug$.pred_class == role_aug$role)
[1] 0.6035172

This model performs slightly worse than Discriminant Analysis model. The accuracy here is 60.35% which is really high considering the number of classes we are trying to guess. it is interesting to see how for e category outcome the logistic regression worked bettern the QDA model but for 5 categories the QDA model was better. The probability of guessing the a players role correctly with no information given is around 20%. After applying statistical modeling and analyzing all the data we increase that probability by 40%. Before statistics we only have a sample size of outcomes and every outcome seems possible and equally likely, with statistics we are able to calculate better odds, better probabilities and get closer to making a more accurate judgement.

Probability as a Foundation of Statistics (Objective 1)

I have learned from this course how Statistics heavily depends on probability. The biggest example of that is the 95% confidence interval. After every research, when hypothesis testing is performed, we are usually only 95% sure of the outcome. As statisticians, we deal with uncertainty. If we could find the true average of a population, then there would be no need for statistics. We know that it is not possible to do that, and by using our different modeling, we can only make the model more and more certain, but we will always have some amount of uncertainty. We do this by creating multiple models, testing them constantly, increasing the sample size, etc. I have implemented bootstrapping in my model to help create a better estimate. Statistics and statistical modeling allows for us to be more informed about the randomness and unpredictability so that we can be more accurate with our guesses. You will notice that I have discussed this from time to time during this project.

Application of linear models (Objective 2)

At the beginning of this course my idea of statistical modeling was limited to simple linear regression and the idea of multiple linear regression seemed quite jarring. In my project you can see the application of Linear and Polynomial models. You can also see me including categorical values in these models. In this class the focus was also to not just look at the Adjusted R squared value but to do thorough testing and looking at the data to come up with the best solution. I have learned to not just look at numbers on the output but also to understand the output and trying to make choices based on those. The numbers provide us with a great starting point but after that we have perform the analysis on our own.

Model Selection (Objective 3)

Throughout this project you will see my model with at least one other model that I used for comparison. It is important to have multiple models to choose from. You will see during my work that I focus on simplicity and try to get a model that can explain the most variability without being overly complex. A simpler model reduces the chance of over fitting and therefore it can be more universally applicable. I have learned to use tools like PCR which are really helpful for giving me a starting points but I try to not rely on them overly. You will see here how I used PCR to get an idea for the models that I want but then switched to other models to create models based on my preference.

Presenting the Models (Objective 4)

Showing the models to general public is the crucial role of a statistician and it is skill that i have been improving on over the time. As someone that is studying stats, it is very easy to forget that people might not understand the terms that are being discussed here and therefore I have tried to explain my final models wherever I can and I believe there is still a lot more work that I can do on that because it will be a life long learning experience.

Fitting and Assessing models (Objective 5)

Over this portfolio you will see my application of fitting the model and using it to predict the data. I have used training and testing data throughout this portfolio and tried to assess all the models that I have worked on here. Model assessment is important not just finding out if the model is good or not but it also helps us manage our expectations with the models. This also allows us to see how the variables react with each other and find out if there is a better approach that we can go for. I am sure you will find my application of model assessment satisfactory here.