Capstone Dota2 Summary

Introduction

The aim of this report is to provide new players to Dota 2 some starting strategies that will help them progress and win matches. The analysis used a Dota 2 dataset of 50,000 matches with data collected on all 10 players per match, leading to 500,000 separate observations. It has identified important variables that influence the amount of gold a player finishes with and has made some key observations with regards to winning matches.

What is Dota 2?

Dota 2 (Defence Of The Ancients 2) is a popular free-to-play multiplayer online battle arena (MOBA) that consists of two teams of five players against each other who fight for two sides; the Radiant and the Dire. The objective of the game is for each team to gain access to the enemy base and destroy their large structure called an ‘Ancient’. Each base is linked by three paths commonly referred to as ‘top, middle or lower lanes’. Each lane has three towers and two barracks. The towers act as defensive structures that automatically attack enemy units and the barracks act as spawn locations for AI-controlled units called ‘creeps’ that go out in waves along their corresponding lanes. These will automatically attack enemy units and structures they encounter in their lanes. The general objective for players is to escort their creep towards the enemy base, help them destroy enemy defence towers, gain access to their base and destroy the enemy ‘Ancient’ structure.

Dota 2 mini map at start of the game where cubes (11/side) represent tower positions for Radiant (green) and Dire (red), smaller double-cubes (6/side)represent barracks location and non-cube (1/side) represents ‘Ancient’ location

The Dataset

Kaggle user Devin posted an excellent dataset of Dota 2 matches to https://www.kaggle.com/devinanzelmo/dota-2-matches. Here data was collected on a wide range of different variables from which heroes are picked, to the kills, deaths and assists they accrue to gold and xp accumulation for each match. This project focused on two main files; match.csv and players.csv. The file Hero_Names.csv was modified and used to support the work conducted on the match and players files.

The key variables within match.csv and players.csv used for this project are listed below. These two files were linked by the variable ‘hero_id’ into the data frame ‘CombinedDF’ and exported to ‘CombinedDF.csv’.

match.csv

Variable	Description
match_id	unique integer to identify matches across all files
start_time	match start time in seconds since 00:00, 01/01/1970, UTC
duration	length of match in seconds
tower_status_radiant	total health of all Radiant towers at match end
tower_status_dire	total health of all Dire towers at match end
barracks_status_dire	total health of all Dire barracks at match end
barracks_status_radiant	total health of all Radiant barracks at match end
first_blood_time	timestamp in seconds of when first player is killed from match start
game_mode	an integer indicating which game mode the match is playing
radiant_win	a factor indicating if the radiant team won or lost the match

players.csv

Variables	Descriptions
match_id	unique integer to identify matches across all files
account_id	unique integer to identify players. 0 (zero) represent anonymous players
hero_id	unique integer to identify which hero was selected per player
player_slot	unique integer where 0-4 represent players on Radiant side and 128-132 the Dire side
gold	amount of gold player has at match end
gold_spent	amount of gold player spent during the match
gold_per_min	amount of gold per minute a player accumulated on average during the match
xp_per_min	amount of experience points per minute a player accumulated on average during the match
kills	total kills a player scored per match
deaths	number of times a player died per match
assists	number of assists a player gained per match
denies	number of denies a player performed per match
last_hit	number of last hits a player landed on creep per match
hero_damage	cumulative amount of damage a player caused to enemy heroes
tower_damage	cumulative amount of damage a player caused to enemy towers
level	the hero character level a player obtained by match end
leaver_status	integer indicating if a player stayed until match end or left early
unit_order_total	the total number of commands a player conducted in a match with keyboard and mouse

Data Wrangling

While initially examining CombinedDF data frame it quickly became clear that some of these matches were not played at a competitive level. It is expected that these matches would collect spurious data and were removed. For example while looking at the histogram of the variable ‘duration’ it was seen that the longest match recorded went on for over 4 hours. Research online on Dota 2 forums suggested that typically matches would last between 25-60 minutes each.

The dataset was narrowed down to matches that last between 15-75 minutes to reduce the number of non-competitive matches in the analysis. This filter also managed to remove most matches that would have been filtered by other means. In the variable ‘hero_id’ where the value was 0 indicated that no hero was selected. This led to teams being unbalanced in numbers at the very start and not a true 5v5 match. Most of the matches with ‘hero_id’ 0 present had already been removed by the 15-75 minute filter of ‘duration’. This was also true for matches where ‘leaver_status’ was not equal to 0. A ‘leaver_status’ of 1-4 indicated that at some point someone left the match, again creating imbalance in those matches. Most of these had also been removed by the ‘duration’ filter.

Finally the file ‘hero_names.csv’ was modified by adding each hero’s class and their roles they are associated with, and then added to the final ‘CombinedDF’ file. This file was the working dataset going forward with 420,510 separate observations across 51 different variables.

Models and Conclusions

When modeling was first performed it was applied to the whole dataset which produced models with enormous variance and noise. It was then decided to look at only one hero for modeling. The following plot shows how often each hero was selected, ordered by popularity and coloured by their win:loss ratio (win/loss).

this plot shows hero popularity coloured by their win:loss ratio where 1 (white) means they won and lost in equal amounts, more than 1 (red) means they won more than they lost and less than 1 (blue) means they lost more than they won. The closer to 1, the lighter the red or blues shading. The numbers in each bar represent the number of different roles each hero is categorized with.

The hero Windranger, the hero selected most often in the dataset, was used for all regression modeling going forward. This was a good choice because Windranger’s Win:Loss ratio is close to 1 and has been categorized for 5 different roles, making it more of a ‘generalist’ hero who can play a range of different roles in the game.

Modelling gold with linear regression

A key component in Dota 2 is gold which allows players to buy and upgrade equipment to make them perform better in a match and is a good proxy for being successful in the game. Gold was compared to several other variables in several multiple linear regression models to identify how influential each independent variable was upon the dependent gold variable. The most important variables are summarised in the table below.

Influential.Variables	Impact
duration	The length of matches will significantly impact the amount of gold a hero will finish with
gold_per_min	the rate of gold a hero gathers through the match
Win	Playing for a winning team will most likely produce more gold
gold_spent	Spending gold will significantly impact how much a hero finishes with
deaths	When a hero dies they lose an amount of gold
xp_per_min	The amount of gold lost upon death is determined by their level. Level is a discrete variable so continuous variable ‘xp_per_min’ is used to represent level
hero_damage	Killing and assisting in killing enemy heroes provide gold. The continuous variable ‘hero_damage’ represents these two discrete variables

The best model identified is as follows.

## 
## Call:
## lm(formula = gold ~ duration * gold_per_min + Win + gold_spent + 
##     deaths * xp_per_min + hero_damage, data = Windranger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4652.6  -537.5   -30.3   482.4  7561.6 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            1.890e+03  1.267e+02   14.923  < 2e-16 ***
## duration              -1.141e+00  5.225e-02  -21.837  < 2e-16 ***
## gold_per_min          -4.307e+00  3.089e-01  -13.946  < 2e-16 ***
## Win                    8.938e+02  1.982e+01   45.100  < 2e-16 ***
## gold_spent            -5.766e-01  3.974e-03 -145.095  < 2e-16 ***
## deaths                 2.520e+01  8.426e+00    2.991  0.00278 ** 
## xp_per_min             2.961e+00  1.646e-01   17.996  < 2e-16 ***
## hero_damage           -1.836e-02  2.059e-03   -8.918  < 2e-16 ***
## duration:gold_per_min  1.197e-02  1.313e-04   91.130  < 2e-16 ***
## deaths:xp_per_min     -4.102e-01  1.720e-02  -23.847  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 904.4 on 17587 degrees of freedom
## Multiple R-squared:  0.7234, Adjusted R-squared:  0.7233 
## F-statistic:  5112 on 9 and 17587 DF,  p-value: < 2.2e-16

The following conclusions for acquiring gold are drawn from this model.

Winning matches has a very strong positive relationship to the amount of gold a player will finish with. To finish with a lot of gold winning is important.
Spending gold has a slight negative relationship with the amount of gold a player finishes with. This suggests that spending gold must be done wisely on equipment that will improve the heroes capability to go out and collect gold more efficiently.
Acquiring experience quickly (xp_per_min) is important to levelling up heroes fast. The higher the level a hero is the more impact they have and the more gold they can accumulate. The faster a hero can level, the more likely it is they will be a higher level than the enemy heroes, increasing their chances of winning fights between them.
Avoiding deaths in a match is important to keeping gold. Gold is lost every time a hero dies. The amount of gold a hero loses is tied to the level they at the time of a death. Therefore dying later in a match when hero levels are higher has much more impact. Avoid deaths, especially towards the end of matches.

Modelling wins by inference with logistic regression

Several variables were analysed in logistic regression models to infer their importance to winning Dota 2 matches. The following logistic regression model was made and found to be most suitable.

## 
## Call:
## glm(formula = Win ~ duration + gold_spent + gold_per_min + xp_per_min + 
##     deaths + assists + hero_damage + unit_order_total, family = binomial, 
##     data = TrainWR)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9825  -0.3332  -0.0413   0.2896   3.1462  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.789e+01  5.351e-01 -33.428  < 2e-16 ***
## duration          2.777e-03  1.609e-04  17.260  < 2e-16 ***
## gold_spent       -2.508e-04  2.143e-05 -11.708  < 2e-16 ***
## gold_per_min      6.058e-02  1.647e-03  36.778  < 2e-16 ***
## xp_per_min       -1.552e-02  7.336e-04 -21.156  < 2e-16 ***
## deaths           -2.547e-01  1.521e-02 -16.747  < 2e-16 ***
## assists           2.667e-01  1.020e-02  26.152  < 2e-16 ***
## hero_damage      -3.121e-04  1.209e-05 -25.805  < 2e-16 ***
## unit_order_total -1.309e-04  2.236e-05  -5.854 4.79e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11949.1  on 8632  degrees of freedom
## Residual deviance:  4538.6  on 8624  degrees of freedom
## AIC: 4556.6
## 
## Number of Fisher Scoring iterations: 7

In this model it is seen that the variables with the most impacting regression coefficients to winning are the combat variables. These are ‘assists’ which has a positive relationship to winning and ‘deaths’ which has a negative relationship to winning.

It was determined in previous models that getting kills in matches did not significantly contribute to winning. Here is the model with ‘kills’ included.

## 
## Call:
## glm(formula = Win ~ duration + gold_spent + gold_per_min + xp_per_min + 
##     deaths + assists + kills + hero_damage + unit_order_total, 
##     family = binomial, data = TrainWR)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9577  -0.3317  -0.0414   0.2912   3.1677  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.763e+01  5.566e-01 -31.668  < 2e-16 ***
## duration          2.753e-03  1.615e-04  17.046  < 2e-16 ***
## gold_spent       -2.517e-04  2.144e-05 -11.738  < 2e-16 ***
## gold_per_min      6.011e-02  1.669e-03  36.022  < 2e-16 ***
## xp_per_min       -1.558e-02  7.352e-04 -21.196  < 2e-16 ***
## deaths           -2.559e-01  1.523e-02 -16.799  < 2e-16 ***
## assists           2.708e-01  1.053e-02  25.708  < 2e-16 ***
## kills             2.631e-02  1.609e-02   1.635    0.102    
## hero_damage      -3.286e-04  1.586e-05 -20.720  < 2e-16 ***
## unit_order_total -1.299e-04  2.238e-05  -5.803  6.5e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11949.1  on 8632  degrees of freedom
## Residual deviance:  4535.9  on 8623  degrees of freedom
## AIC: 4555.9
## 
## Number of Fisher Scoring iterations: 7

In this model it can be seen that the variable ‘kills’ as a predictor is not significant to ‘wins’. This was initially surprising since a core aspect of Dota 2 is combat. However in the original model (Windranger.LogM) it is seen that getting assists is significant and positively impacts winning. This means that for players to increase their chances of winning they should contribute to fighting and help out their teammates in fights with enemy heroes, but should not worry about getting that final killing blow. They should not over-extend themselves just to get a kill as this can expose themselves to unnecessary risks of dying. Avoiding death is very important to being successful in Dota 2, as dying is shown to have a negative impact to winning as shown in the model.

The variable ‘duration’ shows a positive correlation to winning in the model. This implies that Windranger may have a slightly better chance of winning in matches that are longer.

This led to further analysis in the form of comparing duration for Windranger against the second most popular hero, Shadow Fiend. Shadow Fiend is a good selection here for comparison since it is a hero that is more designed to focus on combat, whereas Windranger can cover more roles and is more of a generalist.

The following compares Windranger’s and Shadow Fiend’s duration density plot split out by win / loss.

These density plots show a slight shift in the density curves for both heroes when defined by win and loss. Winning matches for Windranger tend to be slightly longer than losing matches. Shadow fiend shows the opposite. This indicates different types of styles of how players should play these heroes. Windranger slightly prefers a longer match and Shadow Fiend a slightly shorter match. Heroes similar to Shadow Fiend maybe more successful pushing for a quick win while Windranger may benefit from playing more defensively initially, looking to draw out the match. It may be preferable to compare duration against the global median match duration to see this more clearly, however due to time constraints this is now recommended for future work.

This analysis also indicates that further work can be done to understand how team compositions should be put together. Heroes who prefer shorter matches should get together and push for a fast match, and visa versa.

To conclude this logistic model has inferred that to win matches it is important to enter combat and assist in the killing of enemy heroes, but it is not necessarily important to score the killing blows. Avoiding death is also important as the time penalties incurred for dying puts the team on the back foot until that hero can re-join the fight.

It is noted that the length a match goes on for can impact different heroes. The study here has highlighted that Windranger may benefit from slightly longer matches while Shadow Fiend may benefit from slightly shorter matches. Further work is required to analyse all heroes, which could also have an impact on team compositions and team strategies based upon those compositions.

Future Work

The data frame ‘CombinedDF’ has a very large mix of diverse variables that could be used in many ways to explore the game Dota 2. This study has only looked at the Windranger hero in depth, the most selected hero within this dataset. Further work should look at all other heroes to identify how the different variables impact them. This could highlight different strategies that should be used depending on which hero is selected.

The brief look into Shadow Fiend’s duration when compared to Windranger brings in more questions such as what is a good team composition? How should this team play? Do they go for a quick win or should they go for a slow and steady win? One might imagine that a team consisting of only support heroes will struggle to win, as will teams with no support roles. How do different team compositions play out against each other?

The investigation into gold and how the impact of gold_spent raises questions about items. Devin’s original dataset includes a list of item bought throughout each match for each hero. In this study these details was excluded. However a detailed examination of this information might identify which items are more effective at improving a heroes chances of winning matches. It may also show that some items work better for different heroes.

It was noted in ‘Dota2_Data_Wrangling.Rmd’ document that an interesting overlap between the differences in tower health at match end between Radiant and Dire towers existed when compared to ‘Radiant_win’ status, which could be used to identify ‘strong wins’ and ‘close wins’ for each side. Further investigation of this aspect could distill further interesting facts about the way the game is played.

In the original dataset there were 33 separate columns dedicated to ‘unit_order_’ which were simply amalgamated into ‘unit_order_total’ for this study, however analysis of these individual variables may open interesting views on how to play, or how different heroes are played. This could lead to identifying best ways of playing Dota 2.

Further to this the dataset posted to Kaggle by devin contained many more csv files relating to these matches which could be brought into the study to further enrich future analysis. This study has only scratched the surface of the possible avenues of exploration here. It has been extremely interesting and I highly recommend this dataset for further analysis.

References

Key documents for this report are:

Dota2_Introduction.Rmd (http://rpubs.com/LMS/Dota2_Introduction)
Dota2_Data_Wrangling.Rmd (http://rpubs.com/LMS/Dota2_Wrangled)
Dota2_Analysis_Summary.Rmd (http://rpubs.com/LMS/Dota2_Analysis_Summary)

Important support documents with raw working are:

Dota2_Analysis_and_Results.Rmd
Dota2_Detailed_Analysis.Rmd

These documents can be found in the Github depository:

https://github.com/DMzMin/LMS_Dota2

Thanks

Thanks goes out to Devin for posting such a great dataset to Kaggle.com

Special thanks goes to Dr Guy Maskall for all his help, pointers and direction. There is no way I could have done this course without his help and support. Thanks to the springboard.com team who run this course in data science. Thank you!

Extra special thanks to my wife for her support and encouragement throughout my whole course.