League of Legends is a multi-player online battle arena. The objective of the game is to destroy the enemy base, formally known as the enemy Nexus. The first to destroy the Nexus wins. This is easier said than done because there are a series of 11 fortified turrets defending the nexus (must destroy 6 at a minimum). There are many intricacies involved in winning the game such as gold farming by means of killing other players, enemy minions, and neutral monsters. Gold is a precious resource in the game accessible to players. It allows players to purchase powerful items, which enable them to dominate the enemy team.
In this project, I seek to predict whether or not blue team will win their match. (the team that starts on the bottom left side of the map).
There is a red side to the map, which have their own data. For the simplicity’s sake, I’ve decided to only focus on blue side data. I think it would be interesting to do more analysis on this data set while using both blue and red team data.
## 20 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3.199985e-01
## blueWardsPlaced -2.471596e-04
## blueWardsDestroyed .
## blueFirstBlood 1.896893e-02
## blueKills .
## blueDeaths -5.198725e-03
## blueAssists .
## blueEliteMonsters 1.305330e-02
## blueDragons 8.313129e-02
## blueHeralds .
## blueTowersDestroyed -5.144206e-02
## blueTotalGold 1.230784e-05
## blueAvgLevel .
## blueTotalExperience .
## blueTotalMinionsKilled -1.157570e-05
## blueTotalJungleMinionsKilled 6.369299e-04
## blueGoldDiff 6.074397e-05
## blueExperienceDiff 4.085287e-05
## blueCSPerMin -2.859989e-03
## blueGoldPerMin 1.178439e-06
Above are the results of a cross-validated LASSO regression model. The purpose was to eliminate “unimportant” variables and make the x-train data smaller/simpler.
BlueWardsDestroyed, blueKills, blueAssists, blueHeralds, blueAvgLevel, blueTotalExperience are all variables that lasso deemed unimportant. We will continue to build our predictive models using a subset of the data excluding the previously mentioned variables.
We begin with gradient boosted machines!
## var rel.inf
## blueGoldDiff blueGoldDiff 27.9031795
## blueExperienceDiff blueExperienceDiff 20.5554290
## blueTotalGold blueTotalGold 18.8782903
## blueTotalMinionsKilled blueTotalMinionsKilled 10.0735868
## blueTotalJungleMinionsKilled blueTotalJungleMinionsKilled 7.6554363
## blueWardsPlaced blueWardsPlaced 7.5060410
## blueDeaths blueDeaths 4.0363179
## blueEliteMonsters blueEliteMonsters 1.2535946
## blueDragons blueDragons 1.1982501
## blueFirstBlood blueFirstBlood 0.8196457
## blueTowersDestroyed blueTowersDestroyed 0.1202287
## blueCSPerMin blueCSPerMin 0.0000000
## blueGoldPerMin blueGoldPerMin 0.0000000
## [1] "Boost MSE:" "0.225937999714031"
Here we see that the gbm model suggests that GOLD DIFFERENCE is the most important factor in predicting a win for blue team. This makes a lot of sense, because the model doesnt have any other variables to compare to red team. For example, in the actual game of league of legends, deaths, minion kills, monster kills, all contribute to gold amount.
The more gold a team has, the more buying power they have. Having a high buying power allows players to purchase strong items that they can use in combat to kill even more players and minions. It makes sense that the model has condluded this as well; its a direct comparison between the difference between the red and blue team. If one team has more gold than the other, its quite likely that the richer team wins.
Lets see what a Random Forest model has to say about the data!
For a bagging model, we simply use that m=p. In this case we let mtry be the number of columns in our predictor matrix.
## 0 1 class.error
## 0 2391 933 0.2806859
## 1 981 2281 0.3007357
## [1] "Bagging MSE:" "1.26450045551169"
## [1] "Random Forest MSE:" "1.26935924688734"
Here we see that the forest model places a lot of importance on Gold difference. All of the predictors following gold difference are all highly correlated with godl itself. Again, Deaths, minion kills, Experience diff, are all things correlated with killing others and obtaining gold, so this makes sense with what is observed in-game.
The last model that we are going to try with this data set is a Bayesian Additive Regression Tree (BART) model.
## [1] "Bayesian Additive Regression Tree MSE:"
## [2] "1.27339180254668"
For the BART model, we get an MSE of 1.27756. So its not better than random forest; its about as good as bagging and Random Forest. Boosting was the best with an MSE of 0.2173507.
In this report, I used 10-fold cross-validated lasso regression for variable selection. From this, I built boosting, bagging, random forest, and BART models to explore the predictors for winning a League of Legends game.
The LASSO model did a good job of picking the right variables. As an avid League of Legends player, I would say that these “important” variables are definitely representative of my in-game experience. There are other predictors not captured by the data. This is largely an issue with what is in the data itself. I think that the analysis would be better if there were more “difference” metrics. For example, the Gold Difference variable shows the difference in power between the two teams. If one team has more gold than the other, they are in a better position to win the game. Similarly, if one team has more Wards placed than another team, then they are more likely to win the game (this is hypothesis, but I digress).
My main hypothesis is that having difference quantities such as (blueKills - redKills) would provide much better results for classification. Generally, the models above reached 70% accuracy, which is about average across the several coding projects done with this data set. Again, I think that difference quantities between blue and red team would yield a higher classification accuracy.
So we saw across many models that the most important variable was Gold Difference and that the variables following Gold difference were in some way related to gold accumulation. Overall, the best model for this classification task was the boosting model by metric of Mean Squared Error.