source("globals.R")Portfolio
Diving into League of Legends Data
Data Introduction
For this portfolio, I have used data from Professional League of Legends games held in 2024. League of Legends is one of the most popular esports video games. I chose this data because I have worked on it before and wanted to continue my work on it.
Game Introduction
This game has 5 players on each team. Before the start of the game, the players get to choose a character that they would like to play. Each person is also assigned a specific role in the game. There are three lanes on the map of League of Legends. The lanes are called Top, Middle, and Bottom because of their position. The objective to the game is to destroy the structures of the enemy team in all the lanes and then destroy the final structure, which is called the “Nexus”. Players have to earn gold and buy items in order to get stronger and accomplish their goal for the game. Players can earn gold by killing minions, monsters that are unplayable characters, or by killing the characters controlled by enemy players. More gold means that the chances of winning are higher, but the players still have to strategize and plan properly to win the game.
Course Objectives
When I started on my journey with this course. I was sure that there would be an opportunity for me to learn more about new rules and formulas, which I did. But what I did not expect was that I would also learn how to observe, think, and analyze like a statistician. There are so many things in statistics that go against the normal instincts of an average human being. It is because we have developed those instincts based on the biases that we have created from our experiences.
Principal Component Regression
Many players enjoy playing League of Legends, and most want to improve their skills and contribute to their team’s success. It can be challenging for players to identify the key aspects of the game they need to focus on, as the game is quite complex. This part of the analysis and modeling focuses on Individual data from players. As mentioned on the home page, earning gold in this game is the best way to increase the chances of winning the game; therefore, I am just going to create a model to find out what variables affect the amount of gold earned in a game.
For PCR, I selected the variables that made sense to me, based on my knowledge, and the variables that did not have a lot of 0 values or missing values. Running the regsubsets function here is also really good for us because it allows me to see the order in which I should add variables in my model.
pcr_players_data <- players_data|>
select(damagetochampions, totalgold, kills, deaths, assists, wardskilled, wardsplaced, visionscore, position, total.cs)library(leaps)
regfit.full <- regsubsets(totalgold ~ ., pcr_players_data, nvmax = 9)
summary(regfit.full)Subset selection object
Call: regsubsets.formula(totalgold ~ ., pcr_players_data, nvmax = 9)
9 Variables (and intercept)
Forced in Forced out
damagetochampions FALSE FALSE
kills FALSE FALSE
deaths FALSE FALSE
assists FALSE FALSE
wardskilled FALSE FALSE
wardsplaced FALSE FALSE
visionscore FALSE FALSE
positionNon-Laner FALSE FALSE
total.cs FALSE FALSE
1 subsets of each size up to 9
Selection Algorithm: exhaustive
damagetochampions kills deaths assists wardskilled wardsplaced
1 ( 1 ) " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " "
3 ( 1 ) " " "*" " " " " " " " "
4 ( 1 ) " " "*" " " "*" " " " "
5 ( 1 ) "*" "*" " " "*" " " " "
6 ( 1 ) "*" "*" " " "*" "*" " "
7 ( 1 ) "*" "*" " " "*" "*" "*"
8 ( 1 ) "*" "*" " " "*" "*" "*"
9 ( 1 ) "*" "*" "*" "*" "*" "*"
visionscore positionNon-Laner total.cs
1 ( 1 ) " " " " "*"
2 ( 1 ) "*" " " "*"
3 ( 1 ) "*" " " "*"
4 ( 1 ) "*" " " "*"
5 ( 1 ) "*" " " "*"
6 ( 1 ) "*" " " "*"
7 ( 1 ) "*" " " "*"
8 ( 1 ) "*" "*" "*"
9 ( 1 ) "*" "*" "*"
reg.sum <- summary(regfit.full)max_adjr2 <-which.max(reg.sum$adjr2)
plot(reg.sum$adjr2, xlab = "Number of Variables",
ylab = "Adjusted RSq", type = "l")
points(max_adjr2, reg.sum$adjr2[max_adjr2], col = "red", cex = 2,
pch = 20)x <- model.matrix(totalgold ~ ., pcr_players_data)[, -1]
y <- pcr_players_data$totalgoldpcr.fit <- pcr(totalgold ~ ., data = pcr_players_data, scale = TRUE,
validation = "CV")
summary(pcr.fit)Data: X dimension: 29190 9
Y dimension: 29190 1
Fit method: svdpc
Number of components considered: 9
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
CV 3406 2670 1495 1428 1428 1405 1375
adjCV 3406 2670 1495 1428 1428 1405 1375
7 comps 8 comps 9 comps
CV 1370 854.7 831.3
adjCV 1370 854.6 831.2
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 43.28 63.16 74.52 83.64 89.60 95.36 98.19
totalgold 38.58 80.73 82.44 82.44 82.99 83.70 83.82
8 comps 9 comps
X 99.74 100.00
totalgold 93.71 94.05
validationplot(pcr.fit, val.type = "MSEP")set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(pcr_players_data),
replace = TRUE)
test <- (!train)
y.test <- y[test]pcr.fit <- pcr(totalgold ~ ., data = pcr_players_data, subset = train,
scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")Based on the graph here, we can see that the best number of components to select would be 2 or 8. That conclusion can be made because after 2 the drop in MSEP is quite low, and the same can be seen after 8, where there is a steep drop after 7, but that drop gets very negligible after 8.
pcr.pred <- predict(pcr.fit, x[test, ], ncomp = 8)
mean((pcr.pred - y.test)^2)[1] 725780
Linear Regression
lm_spec <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
lm_specLinear Regression Model Specification (regression)
Computational engine: lm
For the first model, I tried the 8 variables that were recommended earlier, where the position of the variable is the categorical variable. I have reduced the position variable into two categories of Laner ( Middle Lane, Top Lane, Bottom Lane)and Non-Laner (Jungle and Support).
gold_mod1 <- lm_spec |>
fit( totalgold ~ (total.cs + visionscore + kills + assists + damagetochampions + wardskilled + wardsplaced)*position ,data = players_data)
glance(gold_mod1)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.942 0.942 819. 31742. 0 15 -237206. 474447. 474588.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(gold_mod1)# A tibble: 16 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1115. 30.0 37.1 1.78e-294
2 total.cs 30.7 0.115 267. 0
3 visionscore 42.0 0.937 44.9 0
4 kills 345. 2.73 126. 0
5 assists 118. 1.93 61.3 0
6 damagetochampions 0.0250 0.00100 25.0 4.65e-136
7 wardskilled -41.6 1.79 -23.2 5.95e-118
8 wardsplaced 3.76 1.67 2.25 2.42e- 2
9 positionNon-Laner 1368. 43.9 31.2 6.65e-210
10 total.cs:positionNon-Laner -2.99 0.213 -14.0 1.11e- 44
11 visionscore:positionNon-Laner 0.0371 1.31 0.0283 9.77e- 1
12 kills:positionNon-Laner -29.3 5.69 -5.15 2.62e- 7
13 assists:positionNon-Laner -4.54 2.55 -1.78 7.48e- 2
14 damagetochampions:positionNon-Laner -0.00382 0.00202 -1.90 5.79e- 2
15 wardskilled:positionNon-Laner -0.967 2.68 -0.360 7.19e- 1
16 wardsplaced:positionNon-Laner -22.6 2.06 -11.0 4.54e- 28
The categorical variable here is interactions with all the numeric variables, but after looking at the model, I decided that I can go ahead and remove most of the interactions and only keep the extremely significant ones. The model is already explaining more than 94% variability here, and reducing the size of the model will be really helpful here.
We want to test out the interaction term because there are times when a categorical and a numeric variable do not make sense individually, but they explain a lot of variability and become significant when they come together. Here, I even removed a couple of significant interaction terms to reduce the size of the model.
gold_mod3 <- lm_spec |>
fit( totalgold ~ total.cs*position + visionscore + kills + assists + damagetochampions + wardskilled + wardsplaced*position ,data = players_data)
glance(gold_mod3)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.942 0.942 819. 47520. 0 10 -237236. 474495. 474595.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(gold_mod3)# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1124. 29.2 38.4 2.19e-315
2 total.cs 30.8 0.107 287. 0
3 positionNon-Laner 1352. 41.4 32.7 2.78e-230
4 visionscore 41.8 0.652 64.1 0
5 kills 340. 2.38 143. 0
6 assists 115. 1.26 91.8 0
7 damagetochampions 0.0246 0.000864 28.5 6.32e-176
8 wardskilled -41.3 1.33 -31.1 9.29e-209
9 wardsplaced 4.77 1.29 3.70 2.20e- 4
10 total.cs:positionNon-Laner -3.62 0.160 -22.6 2.75e-112
11 positionNon-Laner:wardsplaced -23.8 0.940 -25.3 4.69e-140
In the iteration, less than 0.001 adjusted R² was lost, but I decided to keep this model for now. There were a lot of other linear models that were used here, but I am only keeping the most important one.
gold_mod <- lm_spec |>
fit( totalgold ~ total.cs + visionscore + kills + assists + damagetochampions + wardskilled + wardsplaced + position ,data = players_data)
glance(gold_mod)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.940 0.940 832. 57445. 0 8 -237696. 475413. 475496.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(gold_mod) %>%
knitr::kable()| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 1611.9434452 | 24.7225603 | 65.20132 | 0 |
| total.cs | 29.6681053 | 0.0972094 | 305.19783 | 0 |
| visionscore | 43.2877494 | 0.6584523 | 65.74167 | 0 |
| kills | 336.3066375 | 2.4173056 | 139.12458 | 0 |
| assists | 114.7573630 | 1.2769241 | 89.87015 | 0 |
| damagetochampions | 0.0281584 | 0.0008671 | 32.47455 | 0 |
| wardskilled | -42.3655734 | 1.3459800 | -31.47563 | 0 |
| wardsplaced | -14.1298130 | 0.8937534 | -15.80952 | 0 |
| positionNon-Laner | 250.8048397 | 16.2975510 | 15.38911 | 0 |
The model without any interaction was also showing good variability.
After being satisfied with the linear models. I moved on to working with the polynomial model.
I started by adding variables in the same order as I had seen them in regsubsets. I tried out different exponent values for total.cs and using its interaction, I tried adding visionscore and see what exponent values would work best that variable as well.
Polynomial Regression
poly_gold1 <- lm(totalgold ~ poly(total.cs, 2, raw = T)*position + visionscore, data = player_train)
glance(poly_gold1)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.835 0.835 1382. 17194. 0 6 -176749. 353514. 353577.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(poly_gold1)# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.59e+3 119. 38.6 2.45e-314
2 poly(total.cs, 2, raw = T)1 1.71e+1 0.809 21.1 9.44e- 98
3 poly(total.cs, 2, raw = T)2 3.26e-2 0.00143 22.8 2.22e-113
4 positionNon-Laner -1.39e+3 119. -11.6 3.98e- 31
5 visionscore 3.29e+1 0.414 79.5 0
6 poly(total.cs, 2, raw = T)1:positionNo… 9.24e+0 1.06 8.69 3.85e- 18
7 poly(total.cs, 2, raw = T)2:positionNo… -6.79e-3 0.00293 -2.32 2.05e- 2
With just 3 variables, we had a model that was explaining 83% variability.
poly_gold2 <- lm(totalgold ~(total.cs^2)*position + visionscore+ kills + assists , data = player_train)
glance(poly_gold2)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.936 0.936 859. 49980. 0 6 -167020. 334056. 334119.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
tidy(poly_gold2, conf.int = TRUE)# A tibble: 7 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1484. 33.0 44.9 0 1419. 1549.
2 total.cs 31.5 0.115 273. 0 31.3 31.7
3 positionNon-Laner 740. 42.3 17.5 4.39e-68 658. 823.
4 visionscore 28.0 0.242 116. 0 27.5 28.4
5 kills 371. 2.64 140. 0 365. 376.
6 assists 130. 1.51 86.1 0 127. 133.
7 total.cs:positionNon… -2.87 0.180 -15.9 1.04e-56 -3.22 -2.51
For the last model. I added kills and assists to the model. I also tried to add their exponents and log value, but the changes were not substantial enough; therefore, I chose to keep them in their original form in the model.
This is the final model I chose to work with.
Model assessment
We can also see the confidence interval above. Since Stats deals with uncertainty we cannot definitely say what the actual coefficients for this model should be but the confidence intervals provide us the rage in which these coeffiecients will fall under. For example, At 95% confidence interval we can conclude that the coefficient for visonscore will be between 27.478 and 28.42.
poly_gold_aug <- augment(poly_gold2, newdata = player_test)
poly_gold_aug|>
gf_point(.resid~.fitted) |>
gf_hline(yintercept = 0, lty="dashed",col='red') |>
gf_labs(y='Residuals',x='Fitted values')poly_gold_aug|>
gf_qq(~.resid) |>
gf_qqline()Residual Analysis here shows a lot of instances where the normality is not being followed by the residuals. This means that the model is very problematic and we need to work with a variables to make it residuals follow normal distribution. This model cannot be ignored because it still explains a lot of variability in the data but it still needs more fine tuning to be done with it.
boot.fn <- function(data, index)
coef(
lm(totalgold ~(total.cs^2)*position + visionscore+ kills + assists ,
data = data, subset = index)
)
set.seed(90)
boot(player_train, boot.fn, 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = player_train, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 1483.923685 -1.557350096 42.2453800
t2* 31.501545 0.004007909 0.1649853
t3* 740.494868 1.861447052 62.6367087
t4* 27.951399 -0.007977034 0.3314261
t5* 370.537891 0.030271310 3.1664675
t6* 130.114273 0.092362844 1.5048566
t7* -2.866021 -0.005463348 0.2485562
summary(
lm(totalgold ~(total.cs^2)*position + visionscore+ kills + assists , data = player_train)
)$coef Estimate Std. Error t value Pr(>|t|)
(Intercept) 1483.923685 33.0130640 44.94959 0.000000e+00
total.cs 31.501545 0.1153967 272.98483 0.000000e+00
positionNon-Laner 740.494868 42.3106717 17.50137 4.392032e-68
visionscore 27.951399 0.2417388 115.62647 0.000000e+00
kills 370.537891 2.6447327 140.10410 0.000000e+00
assists 130.114273 1.5105960 86.13440 0.000000e+00
total.cs:positionNon-Laner -2.866021 0.1800464 -15.91824 1.038760e-56
After bootstrapping, here we can see a clearer picture of the tour intercept and coefficients. Bootstraping allows us to perform our modeling on a lot of combinations, which gives us better chances of improving our model. More bootstrapping allows us to have the maximum chance of finding the optimal value for our coefficients.
Based on the above information \[ totalgold = 1483.92 + 31.50 \times total.cs + 740.49 \times positonNon-Laner + 27.95 \times visionscore + 370.54 \times kills + 130.11 \times assists - 2.87\times total.cs:positonNon-Laner + \varepsilon \] This means that if everything is constant and total.cs is increased by 1, then the totalgold will increase by 31.50. And that will be the same with all the other variables and their respective coefficients.
Discriminant Analysis
I have talked earlier about how there are different roles and characters for different players. These roles require different tasks to be done by the player and for that reason, we can assume that these players will have different damage numbers and gold earnings. To look into that, I decided to make a discriminant analysis model here to see if we can seperate people based on their roles or position.
library(discrim)
discrim_quad() |>
set_mode("classification") |>
set_engine("MASS") |>
translate()Quadratic Discriminant Model Specification (classification)
Computational engine: MASS
Model fit template:
MASS::qda(formula = missing_arg(), data = missing_arg())
role.qda.fit <- discrim_quad() |>
fit(role ~ kills + deaths + assists + totalgold + visionscore+damagetochampions + total.cs, data = player_train)
role.qda.fitparsnip model object
Call:
qda(role ~ kills + deaths + assists + totalgold + visionscore +
damagetochampions + total.cs, data = data)
Prior probabilities of groups:
Bottom Lane Jungle Middle Lane Support Top Lane
0.1967895 0.1983067 0.2002153 0.2020751 0.2026134
Group means:
kills deaths assists totalgold visionscore damagetochampions
Bottom Lane 4.1178811 2.288237 5.333499 13932.389 46.05098 22061.439
Jungle 2.3889437 2.849704 7.484205 11080.245 51.54911 12461.545
Middle Lane 3.5978978 2.386214 5.615253 13593.687 40.75483 21637.403
Support 0.7846936 3.204650 9.098087 8224.474 111.91378 6282.042
Top Lane 2.6572464 2.806522 4.843237 12462.694 34.22512 16823.743
total.cs
Bottom Lane 287.72768
Jungle 194.67078
Middle Lane 289.47250
Support 48.16275
Top Lane 256.46304
As we can see that there is a difference in means among different roles here for all the variables. This also shows that there is almost equal probability of for a player to belong to any group here. Based on the difference in means I feel strongly about the assessment of this model.
role.qda.class <- predict(extract_fit_engine(role.qda.fit), player_test)$class
table(role.qda.class,player_test$role)
role.qda.class Bottom Lane Jungle Middle Lane Support Top Lane
Bottom Lane 390 152 208 18 58
Jungle 37 1435 38 60 120
Middle Lane 862 69 851 1 305
Support 58 16 0 1582 0
Top Lane 470 114 650 48 1215
mean(role.qda.class == player_test$role)[1] 0.6249857
The accuracy of 62% might seem low without context but when we consider that the chancing of guessing the correct role for a random player will be 1/5 then we can start to appreciate how well this model is doing. The classification models are a perfect way to see how Statistics can turn complete randomness into informed randomness and increases our judgement based on that.
position.qda.fit <- discrim_quad() %>%
fit(position ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train)
position.qda.fitparsnip model object
Call:
qda(position ~ kills + deaths + assists + totalgold + visionscore +
damagetochampions + total.cs, data = data)
Prior probabilities of groups:
Laner Non-Laner
0.5996183 0.4003817
Group means:
kills deaths assists totalgold visionscore damagetochampions
Laner 3.450702 2.496082 5.261916 13322.68 40.28657 20150.013
Non-Laner 1.579269 3.028847 8.298741 9638.92 82.01552 9342.713
total.cs
Laner 277.7458
Non-Laner 120.7273
Based on the earlier Group means I decided to convert the roles into two categories of “Laners” and “Non-Laners”. I imagined that this will be an even more accurate model since we can see the differences in means all across the board here.
pos.qda.class = predict(extract_fit_engine(position.qda.fit), player_test)$class
table(pos.qda.class, player_test$position)
pos.qda.class Laner Non-Laner
Laner 5004 895
Non-Laner 258 2600
mean(pos.qda.class == player_test$position)[1] 0.8683339
The same model has even higher accuracy when deciding between the two categories as one would expect. Once again if were to guess this randomly then our odds of guessing the right one will be 50% if we know nothing about the data (including the prior probabilities).
I tried this same model again with lda just to see what I can get here
discrim_linear() |>
set_mode("classification") |>
set_engine("MASS") |>
translate()Linear Discriminant Model Specification (classification)
Computational engine: MASS
Model fit template:
MASS::lda(formula = missing_arg(), data = missing_arg())
pos.lda <- discrim_linear()|>
fit(position ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train)
pos.ldaparsnip model object
Call:
lda(position ~ kills + deaths + assists + totalgold + visionscore +
damagetochampions + total.cs, data = data)
Prior probabilities of groups:
Laner Non-Laner
0.5996183 0.4003817
Group means:
kills deaths assists totalgold visionscore damagetochampions
Laner 3.450702 2.496082 5.261916 13322.68 40.28657 20150.013
Non-Laner 1.579269 3.028847 8.298741 9638.92 82.01552 9342.713
total.cs
Laner 277.7458
Non-Laner 120.7273
Coefficients of linear discriminants:
LD1
kills -2.987363e-02
deaths 4.899846e-02
assists 5.621378e-02
totalgold 1.299493e-04
visionscore 6.559287e-03
damagetochampions -5.085047e-05
total.cs -1.105491e-02
This model also shows us the coefficents of Linear Discriminants which shows us the weight that has been allocated to each varaible in this model.
pos.lda.class = predict(extract_fit_engine(pos.lda), player_test)$class
table(pos.lda.class, player_test$position)
pos.lda.class Laner Non-Laner
Laner 5116 1025
Non-Laner 146 2470
mean(pos.lda.class == player_test$position)[1] 0.8662784
We can see here that there is not a huge difference between linear and quadrilateral Discriminant Analysis here but because of simplicity if I were to choose a specific Model then I will chose the linear model.
Logistic Regression
When looking from team perspective, we can assume that all teams want to win in this game and for that reason I decided to look at the result variable for the game.
Here I once again picked the variables that seemed useful to me based on my knowledge and then started using p-values and other stats to start removing the variables that were not good for the model.
teams_data |>
ggplot(aes(x = result, y = totalgold ,fill = result))+
geom_boxplot(fill = c("navy", "maroon"))+
scale_y_continuous(expand = c(0,0))teams_data |>
ggplot(aes(x = result, y = teamkills ,fill = result))+
geom_boxplot(fill = c("navy", "maroon"))+
scale_y_continuous(expand = c(0,0))We can clearly see the difference between average totalkills in victory games and Defeats game. That seemed like a great way to start the logistical regression here.
log_reg <- logistic_reg()|>
set_engine("glm")
result_mod <- log_reg|>
fit(result ~ teamkills + teamdeaths + gamelength + totalgold +dragons + barons+ wardskilled + visionscore +wardskilled +leaguetype, data = team_train, family = "binomial" )
tidy(result_mod) %>%
knitr::kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -1.060 | 0.817 | -1.298 | 0.194 |
| teamkills | 0.268 | 0.036 | 7.478 | 0.000 |
| teamdeaths | -0.503 | 0.033 | -15.342 | 0.000 |
| gamelength | -0.020 | 0.002 | -11.707 | 0.000 |
| totalgold | 0.001 | 0.000 | 12.284 | 0.000 |
| dragons | 0.219 | 0.112 | 1.962 | 0.050 |
| barons | -0.233 | 0.179 | -1.298 | 0.194 |
| wardskilled | 0.015 | 0.010 | 1.472 | 0.141 |
| visionscore | -0.008 | 0.004 | -1.908 | 0.056 |
| leaguetypeInternational | 0.696 | 0.556 | 1.253 | 0.210 |
I had to do trial and error here to remove many variables that were giving me a really high P-value and there were some variables that were not giving me high p-value for themselves but keeping them in the model was giving me high p-value for the intercept and I had to clean all of those variables from the model. After a lot of trials and errors I got to the model below.
result_mod1 <- log_reg|>
fit(result ~ teamkills + wardskilled + visionscore , data = team_train, family = "binomial" )
glance(result_mod1)# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 5664. 4085 -1556. 3119. 3144. 3111. 4082 4086
tidy(result_mod1) %>%
knitr::kable(digits = 3)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -4.242 | 0.204 | -20.785 | 0 |
| teamkills | 0.372 | 0.011 | 32.927 | 0 |
| wardskilled | 0.018 | 0.004 | 4.324 | 0 |
| visionscore | -0.006 | 0.001 | -5.290 | 0 |
\[ \log\left(\frac{\hat{p}}{1-\hat{p}}\right) = -4.242 + 0.372 \times (teamkills) + 0.018 \times (wardskilled) - 0.006 \times (visionscore) \]
This shows that For each additional each extra teamkill in a game, we expect the log odds the team winning to increase by .372. Assuming that no other variables are changed.
\[ \text{odds} = e^{\log(\text{odds})} \]
# To store residuals and create row number variable
result_aug <- augment(extract_fit_engine(result_mod1),new_data = train_team, type.predict = "response",
type.residuals = "deviance") |>
mutate(id = row_number())
# Plot residuals vs fitted values
ggplot(data = result_aug, aes(x = .fitted, y = .resid, color = result)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "Fitted values",
y = "Deviance residuals",
title = "Deviance residuals vs. fitted")# Plot residuals vs row number
ggplot(data = result_aug, aes(x = id, y = .resid, color = result)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(x = "id",
y = "Deviance residuals",
title = "Deviance residuals vs. id")The residual analysis shows a nice spread in the residual against the id column which shows that the variability in residuals is there. There are not a lot of problems with extreme values. There is a huge gap between victory the points when we looking at the fitted value graph as it is expected in this kind of dataset.
Prediction Accuracy
result_glm <- glm(result ~ teamkills + wardskilled + visionscore , data = team_train, family = "binomial" )
pred_prob <- predict(result_glm, newdata = team_test, type = "response")
pred_result <- ifelse(pred_prob > 0.5, "Victory", "Defeat")
table(pred_result)pred_result
Defeat Victory
902 850
mean(team_train$result == pred_result)Warning in `==.default`(team_train$result, pred_result): longer object length
is not a multiple of shorter object length
Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
shorter object length
[1] 0.5073421
I decided to check the response here as well to see how accurate the model was and the accuracy around 50% is very concerning here. This model suggests that our model is almost as good as guessing the winner of the game. In a way, this shows that the game is really unpredictable and there is an opportunity for people to make a come back from behind in any scenario.
position_glm <- glm(position ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train, family = "binomial")
position_glm
Call: glm(formula = position ~ kills + deaths + assists + totalgold +
visionscore + damagetochampions + total.cs, family = "binomial",
data = player_train)
Coefficients:
(Intercept) kills deaths assists
1.753e+00 6.768e-02 1.522e-01 1.926e-01
totalgold visionscore damagetochampions total.cs
-6.482e-07 4.283e-02 -1.671e-04 -1.637e-02
Degrees of Freedom: 20432 Total (i.e. Null); 20425 Residual
Null Deviance: 27510
Residual Deviance: 11010 AIC: 11020
pred_prob <- predict(position_glm, newdata = player_test, type = "response")
pred_position <- ifelse(pred_prob > 0.5, "Non-Laner", "Laner")
table(pred_result)pred_result
Defeat Victory
902 850
mean(player_test$position == pred_position)[1] 0.9086445
I decided to try the logistic regression model with the same parameters that I had used for Quadratic Discriminant Analysis to determine wether a player was a laner or not. It is really interesting to see that the logistic regression model performs better than QDA and LDA above.
This logistic regression model shows that the log odd of a person being a “Non-Laner” increase by .06 with each extra kill, if everything else remains similar.
Multinomial Logsitic Regression
After seeing how good the Logistic Regression model was for Identifying wether a person was a “Laner” or “Non-Laner”. I am going to see if I can use a mulitinomial Logistic Regression Model to see if I can identify what position a person plays: “Top Lane”, “Middle Lane”, “Bottom Lane”, “Jungler” or “Support”.
multi_role_mod <- multinom_reg() |>
set_engine("nnet") |>
fit(role ~ kills + deaths + assists + totalgold + visionscore + damagetochampions + total.cs, data = player_train)
tidy(multi_role_mod) |>
print(n = Inf)# A tibble: 32 × 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Jungle (Intercept) 3.06 0.00000428 714664. 0
2 Jungle kills -0.0854 0.000139 -614. 0
3 Jungle deaths 0.219 0.0000388 5654. 0
4 Jungle assists 0.210 0.000173 1213. 0
5 Jungle totalgold 0.000263 0.0000224 11.8 6.78e- 32
6 Jungle visionscore 0.0109 0.00175 6.24 4.31e- 10
7 Jungle damagetochampions -0.000184 0.00000549 -33.4 4.33e-245
8 Jungle total.cs -0.0225 0.000861 -26.1 4.19e-150
9 Middle Lane (Intercept) -0.0399 0.00000145 -27467. 0
10 Middle Lane kills -0.0104 0.0000817 -127. 0
11 Middle Lane deaths 0.0477 0.0000184 2589. 0
12 Middle Lane assists 0.0961 0.0000768 1251. 0
13 Middle Lane totalgold -0.000185 0.0000171 -10.8 2.04e- 27
14 Middle Lane visionscore -0.0240 0.00167 -14.3 1.50e- 46
15 Middle Lane damagetochampions 0.000000158 0.00000335 0.0473 9.62e- 1
16 Middle Lane total.cs 0.0105 0.000666 15.8 1.77e- 56
17 Support (Intercept) 5.01 0.00000251 1999788. 0
18 Support kills -0.287 0.0000656 -4367. 0
19 Support deaths 0.289 0.0000557 5196. 0
20 Support assists 0.298 0.000303 980. 0
21 Support totalgold 0.000172 0.0000431 3.98 6.82e- 5
22 Support visionscore 0.0404 0.00273 14.8 1.65e- 49
23 Support damagetochampions -0.000385 0.0000130 -29.7 3.32e-194
24 Support total.cs -0.0416 0.00168 -24.8 2.36e-135
25 Top Lane (Intercept) 1.27 0.0000100 126657. 0
26 Top Lane kills -0.408 0.000140 -2909. 0
27 Top Lane deaths 0.129 0.0000853 1513. 0
28 Top Lane assists -0.0125 0.000132 -94.6 0
29 Top Lane totalgold 0.000818 0.0000190 43.2 0
30 Top Lane visionscore -0.0746 0.00204 -36.6 5.13e-293
31 Top Lane damagetochampions -0.0000663 0.00000409 -16.2 4.83e- 59
32 Top Lane total.cs -0.0247 0.000721 -34.3 2.72e-257
In Multinomial Logistic Regression we have multiple equations based on the number of possible outcomes so if we were to take only position into consideration then the solution will look something like the following. I am going to take Jungle as an example. We calculate the odds of a person playing the Jungle role against people playing every other role. \[ \begin{align*} \log\left(\frac{\hat{p}_{\text{jungle}}}{\hat{p}_{\text{others}}}\right) &= 3.06 \\ &- 8.54 \times \text{kills} \\ &+ 2.19 \times \text{deaths} \\ &+ 2.10 \times \text{assists} \\ &+ 2.63 \times \text{totalgold} \\ &+ 1.09 \times \text{visionscore} \\ &- 1.83 \times \text{damagetochampions} \\ &- 2.25 \times \text{total.cs} \end{align*} \]
Based on the data here we can say that the if a player’s kills increased by one then the log odds of that player being a jungle is reduced by 8.54. While everything else is constant.
Model Performance
role_aug <- augment(multi_role_mod, new_data = player_test)
role_aug# A tibble: 8,757 × 20
.pred_class `.pred_Bottom Lane` .pred_Jungle `.pred_Middle Lane`
<fct> <dbl> <dbl> <dbl>
1 Top Lane 0.200 0.0550 0.360
2 Top Lane 0.188 0.0268 0.263
3 Top Lane 0.0995 0.0461 0.236
4 Bottom Lane 0.315 0.127 0.294
5 Support 0.000228 0.0834 0.00000641
6 Bottom Lane 0.368 0.0986 0.312
7 Jungle 0.00411 0.644 0.00616
8 Top Lane 0.239 0.136 0.288
9 Top Lane 0.163 0.0232 0.263
10 Bottom Lane 0.413 0.115 0.349
# ℹ 8,747 more rows
# ℹ 16 more variables: .pred_Support <dbl>, `.pred_Top Lane` <dbl>,
# teamname <chr>, champion <chr>, playername <chr>, kills <int>,
# deaths <int>, assists <int>, wardskilled <int>, wardsplaced <int>,
# visionscore <int>, position <fct>, total.cs <int>, totalgold <int>,
# damagetochampions <int>, role <fct>
table(role_aug$.pred_class, role_aug$role)
Bottom Lane Jungle Middle Lane Support Top Lane
Bottom Lane 534 214 485 1 211
Jungle 147 1424 67 96 205
Middle Lane 703 26 718 2 254
Support 41 11 0 1581 0
Top Lane 392 111 477 29 1028
role_aug|>
ggplot(aes(x = role, fill = .pred_class))+
geom_bar()The Graph here shows tht Support and jungle were the easiest to tell apart while other laning positions were more likely to be confused with one another. This also explains why our earlier model performed so well. This model shows show laning positions are the ones that are most difficult to tell apart and that is where most of our inaccurate predictions were coming from. Looking at Bottom Lane we can see that Bottom Lane is more likely to be predicted as Middle Lane than it is Bottom Lane and Middle Lane was getting predicted as other lanes quite often. This shows how all the laning positions are quite similar and Top Lane position might be the one that is the most different out of the Laning roles while Middle Lane and Bottom Lane are extremely close to one another in playstyle.
mean(role_aug$.pred_class == role_aug$role)[1] 0.6035172
This model performs slightly worse than Discriminant Analysis model. The accuracy here is 60.35% which is really high considering the number of classes we are trying to guess. it is interesting to see how for e category outcome the logistic regression worked bettern the QDA model but for 5 categories the QDA model was better. The probability of guessing the a players role correctly with no information given is around 20%. After applying statistical modeling and analyzing all the data we increase that probability by 40%. Before statistics we only have a sample size of outcomes and every outcome seems possible and equally likely, with statistics we are able to calculate better odds, better probabilities and get closer to making a more accurate judgement.
Probability as a Foundation of Statistics (Objective 1)
I have learned from this course how Statistics heavily depends on probability. The biggest example of that is the 95% confidence interval. After every research, when hypothesis testing is performed, we are usually only 95% sure of the outcome. As statisticians, we deal with uncertainty. If we could find the true average of a population, then there would be no need for statistics. We know that it is not possible to do that, and by using our different modeling, we can only make the model more and more certain, but we will always have some amount of uncertainty. We do this by creating multiple models, testing them constantly, increasing the sample size, etc. I have implemented bootstrapping in my model to help create a better estimate. Statistics and statistical modeling allows for us to be more informed about the randomness and unpredictability so that we can be more accurate with our guesses. You will notice that I have discussed this from time to time during this project.
Application of linear models (Objective 2)
At the beginning of this course my idea of statistical modeling was limited to simple linear regression and the idea of multiple linear regression seemed quite jarring. In my project you can see the application of Linear and Polynomial models. You can also see me including categorical values in these models. In this class the focus was also to not just look at the Adjusted R squared value but to do thorough testing and looking at the data to come up with the best solution. I have learned to not just look at numbers on the output but also to understand the output and trying to make choices based on those. The numbers provide us with a great starting point but after that we have perform the analysis on our own.
Model Selection (Objective 3)
Throughout this project you will see my model with at least one other model that I used for comparison. It is important to have multiple models to choose from. You will see during my work that I focus on simplicity and try to get a model that can explain the most variability without being overly complex. A simpler model reduces the chance of over fitting and therefore it can be more universally applicable. I have learned to use tools like PCR which are really helpful for giving me a starting points but I try to not rely on them overly. You will see here how I used PCR to get an idea for the models that I want but then switched to other models to create models based on my preference.
Presenting the Models (Objective 4)
Showing the models to general public is the crucial role of a statistician and it is skill that i have been improving on over the time. As someone that is studying stats, it is very easy to forget that people might not understand the terms that are being discussed here and therefore I have tried to explain my final models wherever I can and I believe there is still a lot more work that I can do on that because it will be a life long learning experience.
Fitting and Assessing models (Objective 5)
Over this portfolio you will see my application of fitting the model and using it to predict the data. I have used training and testing data throughout this portfolio and tried to assess all the models that I have worked on here. Model assessment is important not just finding out if the model is good or not but it also helps us manage our expectations with the models. This also allows us to see how the variables react with each other and find out if there is a better approach that we can go for. I am sure you will find my application of model assessment satisfactory here.