Dota 2 Capstone

Understanding the effect of team selection, hero selection and item purchase on Dota 2 gameplay

Introduction

The Dota 2 dataset was downloaded from Kaggle. It is being used for the Springboard Foundations of Data Science course capstone project. The dataset has 18 csv files, with the data of 50,000 ranked Dota matches.

Approaching the Capstone

The capstone project is approached by first identifying the problem statement. Next, the most important csv`s files are identified. Data wrangling is done on the most important files followed by other files. After this, exploratory data analysis and some basic statistical correlations are performed. Finally, regressions and machine learning algorithms are used on the dataset.

Problem Statement

The goal of the capstone is to identify potential biases in the gameplay structure of Dota 2. By gameplay structure, we refer to the more than 100 heroes and items that can be picked by players. It is important that any game, board game or computer game, measures the true skill level of a player accurately. It is possible that some specific combinations of pre-game choices affects the winning ability of a player. If such a situation were to arise, the game is unlikely to be an accurate estimate of the player`s true skill level.

Why and for whom is this problem statement important

Valve Corporation is the developer of Dota 2. For Valve Corporation, it is of utmost importance that their games are accurate predictors of player skill level. If a player manages to find loopholes or shortcuts that allow them to win, this will have a negative effect on their reputation. Ensuring fairness and accuracy of gameplay and game mechanic is very important for any game developer.

What specific questions must be answered, with respect to the dataset, to solve this problem statement

To solve this problem statement, we focus on heroes, combination of heroes in a team and inventory items.

At the start of a game, each player is allowed to choose a hero. Heroes are divided into three types: strength (STR), agility (AGI) and intelligence (INT). As the names suggest, each type of hero has certain advantages. We can try finding if certain types of heroes allow players to win more easily.

Many Dota games are 5 on 5 multiplayer matches. There are two teams: radiant and dire. Players are allowed to choose any hero from the list of 113 heroes. A hero already chosen by another player cannot be chosen again. So, all ten heroes in a 5 on 5 multiplayer game are unique. For a team, there are no restrictions on the types of heroes they can choose, other than the requirement that all heroes in the game must be unique. A team is free to choose any number of strength, agility and intelligence heroes.

Finally, each player can buy inventory items using the gold they get from killing creeps, destroying buildings and killing opponent heroes. The player and their hero can use this gold to buy items that improve the attack and defense ratings of the hero. Understanding if specific combinations of heroes and items give a player undue advantage can also help solve the problem statement.

Important CSV files

match_csv, player_rating_csv and players_csv are the 3 most important csv files.

match_csv contains high level information about each match including duration, time to first blood, status of tower and barracks of two teams at end of game, geographical region where the game is taking place and the team that won the game (radiant or dire).

## 'data.frame':    50000 obs. of  13 variables:
##  $ match_id               : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ start_time             : int  1446750112 1446753078 1446764586 1446765723 1446796385 1446798766 1446800938 1446804030 1446819063 1446837251 ...
##  $ duration               : int  2375 2582 2716 3085 1887 1574 2124 2328 2002 2961 ...
##  $ tower_status_radiant   : int  1982 0 256 4 2047 2047 1972 2046 0 0 ...
##  $ tower_status_dire      : int  4 1846 1972 1924 0 4 0 0 1982 1972 ...
##  $ barracks_status_dire   : int  3 63 63 51 0 3 3 0 63 63 ...
##  $ barracks_status_radiant: int  63 0 48 3 63 63 63 63 0 0 ...
##  $ first_blood_time       : int  1 221 190 40 58 113 4 255 4 85 ...
##  $ game_mode              : int  22 22 22 22 22 22 22 22 22 22 ...
##  $ radiant_win            : Factor w/ 2 levels "False","True": 2 1 1 1 2 2 2 2 1 1 ...
##  $ negative_votes         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ positive_votes         : int  1 2 0 0 0 0 0 0 0 0 ...
##  $ cluster                : int  155 154 132 191 156 155 151 138 182 133 ...

player_ratings_csv has information about all the human players who played the 50,000 games. The information includes player account_id, total games they have played, total games in which they were a part of the winning team, mean of player skill and standard deviation of player skill. Player skill level and their standard deviation have been calculated using Microsoft Research`s trueskill algorithm, which is widely used in the gaming world. Valve Corporation is not involved in the way player skill is calculated. So, the player skill mean and standard deviation columns have to be cross checked for accuracy.

## 'data.frame':    834226 obs. of  5 variables:
##  $ account_id     : int  236579 -343 -1217 -1227 -1284 308663 79749 -1985 -2160 26500 ...
##  $ total_wins     : int  14 1 1 1 0 1 21 0 8 26 ...
##  $ total_matches  : int  24 1 1 1 1 1 40 1 12 50 ...
##  $ trueskill_mu   : num  27.9 26.5 26.5 27.2 22.9 ...
##  $ trueskill_sigma: num  5.21 8.07 8.11 8.09 8.09 ...

players_csv has specific information about each match, player account_id that participated in the match, hero_id of the hero used by the player, gold earned and spent, xp earned, deaths, kills, assists, stuns and last hits per hero. It also gives detailed information about gold extracted from killing creeps, killing other heroes and destroying buildings.

## 'data.frame':    500000 obs. of  73 variables:
##  $ match_id                         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ account_id                       : int  0 1 0 2 3 4 0 5 0 6 ...
##  $ hero_id                          : int  86 51 83 11 67 106 102 46 7 73 ...
##  $ player_slot                      : int  0 1 2 3 4 128 129 130 131 132 ...
##  $ gold                             : int  3261 2954 110 1179 3307 476 317 2390 475 60 ...
##  $ gold_spent                       : int  10960 17760 12195 22505 23825 12285 10355 13395 5035 17550 ...
##  $ gold_per_min                     : int  347 494 350 599 613 397 303 452 189 496 ...
##  $ xp_per_min                       : int  362 659 385 605 762 524 369 517 223 456 ...
##  $ kills                            : int  9 13 0 8 20 5 4 4 1 1 ...
##  $ deaths                           : int  3 3 4 4 3 6 13 8 14 11 ...
##  $ assists                          : int  18 18 15 19 17 8 5 6 8 6 ...
##  $ denies                           : int  1 9 1 6 13 5 2 31 0 0 ...
##  $ last_hits                        : int  30 109 58 271 245 162 107 208 27 147 ...
##  $ stuns                            : Factor w/ 233910 levels "-0.000122041",..: 206433 221256 233910 233910 233910 233910 233910 233910 188583 176916 ...
##  $ hero_damage                      : int  8690 23747 4217 14832 33740 10725 15028 10230 4774 6398 ...
##  $ hero_healing                     : int  218 0 1595 2714 243 0 764 0 0 292 ...
##  $ tower_damage                     : int  143 423 399 6055 1833 112 0 2438 0 0 ...
##  $ item_0                           : int  180 46 48 63 114 145 50 41 36 63 ...
##  $ item_1                           : int  37 63 60 147 92 73 11 63 0 9 ...
##  $ item_2                           : int  73 119 59 154 147 149 102 36 0 116 ...
##  $ item_3                           : int  56 102 108 164 0 48 36 147 46 65 ...
##  $ item_4                           : int  108 24 65 79 137 212 185 168 0 229 ...
##  $ item_5                           : int  0 108 0 160 63 0 81 21 180 79 ...
##  $ level                            : int  16 22 17 21 24 19 16 19 12 18 ...
##  $ leaver_status                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ xp_hero                          : num  8840 14331 6692 8583 15814 ...
##  $ xp_creep                         : num  5440 8440 8112 14230 14325 ...
##  $ xp_roshan                        : num  NA 2683 NA 894 NA ...
##  $ xp_other                         : num  83 671 453 293 62 1 1 244 27 933 ...
##  $ gold_other                       : num  50 395 259 100 NA ...
##  $ gold_death                       : num  -957 -1137 -1436 -2156 -1437 ...
##  $ gold_buyback                     : num  NA NA -1015 NA -1056 ...
##  $ gold_abandon                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ gold_sell                        : num  212 1650 NA 938 4194 ...
##  $ gold_destroying_structure        : num  3120 3299 3142 4714 3217 ...
##  $ gold_killing_heros               : num  5145 6676 2418 4104 7467 ...
##  $ gold_killing_creeps              : num  1087 4317 3697 10432 9220 ...
##  $ gold_killing_roshan              : num  400 937 400 400 400 NA NA NA NA NA ...
##  $ gold_killing_couriers            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ unit_order_none                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ unit_order_move_to_position      : num  4070 5894 7053 4712 3853 ...
##  $ unit_order_move_to_target        : num  1 214 3 133 7 166 63 11 55 2 ...
##  $ unit_order_attack_move           : num  25 165 132 163 7 76 100 214 5 105 ...
##  $ unit_order_attack_target         : num  416 1031 645 690 1173 ...
##  $ unit_order_cast_position         : num  51 98 36 9 31 196 13 122 68 64 ...
##  $ unit_order_cast_target           : num  144 39 160 15 84 3 173 NA 18 102 ...
##  $ unit_order_cast_target_tree      : num  3 4 20 7 8 5 14 3 9 19 ...
##  $ unit_order_cast_no_target        : num  71 439 373 406 198 96 168 506 71 124 ...
##  $ unit_order_cast_toggle           : num  NA NA NA NA NA 2 NA NA NA NA ...
##  $ unit_order_hold_position         : num  188 346 643 150 111 161 118 491 97 135 ...
##  $ unit_order_train_ability         : num  16 22 17 21 23 19 16 18 12 18 ...
##  $ unit_order_drop_item             : num  NA NA 5 NA 1 NA NA 2 1 2 ...
##  $ unit_order_give_item             : num  NA NA NA NA NA NA NA 3 2 NA ...
##  $ unit_order_pickup_item           : num  NA 12 7 1 NA 2 1 18 1 2 ...
##  $ unit_order_pickup_rune           : num  2 52 8 9 2 NA 1 18 1 26 ...
##  $ unit_order_purchase_item         : num  35 30 28 45 44 36 43 30 38 33 ...
##  $ unit_order_sell_item             : num  2 4 NA 7 6 3 3 1 NA 4 ...
##  $ unit_order_disassemble_item      : num  NA NA 1 NA NA NA NA NA NA NA ...
##  $ unit_order_move_item             : num  11 21 18 14 13 3 13 19 14 22 ...
##  $ unit_order_cast_toggle_auto      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ unit_order_stop                  : num  NA NA NA NA NA NA NA NA 21 NA ...
##  $ unit_order_taunt                 : logi  NA NA NA NA NA NA ...
##  $ unit_order_buyback               : num  NA NA 1 NA 1 2 NA NA 1 1 ...
##  $ unit_order_glyph                 : num  NA NA NA 1 3 NA 4 NA 1 1 ...
##  $ unit_order_eject_item_from_stash : num  NA NA NA NA NA NA 1 NA NA NA ...
##  $ unit_order_cast_rune             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ unit_order_ping_ability          : num  6 14 17 13 23 2 1 4 4 14 ...
##  $ unit_order_move_to_direction     : num  NA NA NA NA NA NA NA 110 NA NA ...
##  $ unit_order_patrol                : logi  NA NA NA NA NA NA ...
##  $ unit_order_vector_target_position: logi  NA NA NA NA NA NA ...
##  $ unit_order_radar                 : logi  NA NA NA NA NA NA ...
##  $ unit_order_set_item_combine_lock : logi  NA NA NA NA NA NA ...
##  $ unit_order_continue              : logi  NA NA NA NA NA NA ...

Data Wrangling

Identify most important csv files

First part of data wrangling is to identify the most important csv files out of the 18 csv files. The 3 most important files have already been identified as match_csv, player_ratings_csv and players_csv.

Eliminate columns that do not add value to analysis

Some hist functions were run on game_mode, negative_votes and positive_votes columns of match_csv file. Results from these hist functions showed that two of these variables had only one respective value. While game_mode had two values, the column is still considered insignificant to our analysis due to the nature of game_mode. Since the three columns do not add value to our analysis, they are removed.

Check if data points of some important variables make sense

In match_csv, the mean of first__blood_time returned a value of 93. Having played Dota, it is obvious that 93 minutes is too long for the mean of first blood time. A cursory internet search shows that in a sample tournament, first blood time was 3 minutes. It is safe to assume that the unit of first_blood_time is seconds. Close to 25000 observations have a first blood time less than 100 seconds. This seems improbable since on average, a player takes atleast 1.5 minutes to buy some inventory and head to center of the map. While first_blood_time has a unit of seconds, there may be some errors in the way it is measured.

Convert incorrect negative values to 0

In multiple columns, there are negative value observations. For instance, some time and account id columns have negative values. This is incorrect and must be modified. Such incorrect negative values are made equal to 0.

Why not convert incorrect negative values to NA

Majority of incorrect negative values are seen in time columns of different csv files and the account_id column of player_ratings_csv file. In the chat_csv file`s time column, which is arranged in ascending of time per match id, the negative values are very low single digit numbers. It is likely that these actions occurred at the very beginning of the game. Instead of counting them as 0 time, the data was incorrectly recorded as low negative values. So, for the sake of simplicity, all negative time values are converted to 0. For account_id, we know that players can choose to not reveal their id and are counted as anonymous. These anonymous players are counted as 0 in the account_id column. Again, for the sake of simplicity, negative account_id are also considered anonymous and converted to 0 value.

Nullify columns with very high percentage of NA values

For the sake of simplicity, columns with very high percentage of NA values are nullified. These columns are considered to be not significant for the purpose of our analysis. This is done using the summarise_at() function of dplyr.

Feature Engineering

Check if some small csv files can be combined with larger csv files to make it easier to visualize data

In match_csv, a variable called cluster is shown in numeric. This variable shows the geographical region where the match is taking place. A cluster_csv file uses a key value pair, where key is numeric cluster variable and value is name of geographical region. Left_join is used to merge the match_csv and cluster_csv files to help with data visualization. Multiple such operations are performed to combine small csv`s with larger ones.

Classify hero_id column in players_csv to STR, AGI and INT

For the ease of classification, analysis and predictive modelling, heroes are classified into STR (strength), AGI (agility) and INT (intelligence). The 113 heroes have already been classified by a former Springboard student, Louis Montague, who has posted this specific csv file on his github profile.

https://github.com/DMzMin/LMS_Dota2/blob/master/hero_names_mod.csv

Remove rows in players_csv and players_ratings_csv where account_id = 0

account_id = 0 are rows where the player has chosen to remain anonymous and not disclose their identity. With account_id being a categorical variable, having too many rows with 0 id`s tends to distort our analysis, especially when combining csv files using account_id. So, while players_csv file is kept as it is, a new players_csv1 is created after removing rows with account_id as 0.

Replace 0 to 4 and 128-132 in players_csv1 to Radiant and Dire

A cursory internet search shows that the player_slot column in players_csv1 actually stands for the two dota teams: radiant and dire. While 0-4 is team radiant, 128-132 is team dire. These numbers are replaced with team names in the player_slot column of players_csv1.

Convert observations in columns item_0 to item_5 in players_csv3, from numbers to class of items to reduce number of factors

item_0 to item_5 columns contain the ID`s of the items bought by players and their respective hero. Heroes have six slots on their dashboard to keep items. The 0 to 5 stands for the six slots. While the original dataset populates item_0 to item_5 columns using the item ID, we will classify items based on class of items. The items will be converted to STR (strength), AGI (agility), INT (intelligence), MOB (mobility), MISC (miscellaneous) etc. Also, some items combine multiple classes. These items are simply classified as 2 class (item that contains combination of 2 classes), 3 class upto 5 class. This type of classification reduces number of factors to manageable limits.

item_class_1 is a newly created csv file, which duplicates item_ids_csv and adds a item_class column. The item_class column is populated with the class of items mentioned earlier. This is done using information extracted from the internet.

player_ratings_csv2 created which only includes account_id`s that have played 5 games or more

Human players that have played less than 5 games of Dota are poor predictors of skill and gameplay fairness. player_ratings_csv2 is created that only contains players who have played more than 5 games.

Exploratory Data Analysis

Hero Stats

players_csv has detailed information about the player, the hero they chose and their in-game performance. This information is grouped by hero class, which are STR/INT/AGI. Further, all stats are normalized such that a particular stat of STR hero plus AGI hero plus INT hero equals one. This operation allows us to view all stats on a single heatmap. The heatmap is shown below

From the heatmap, it can be seen that, on an absolute basis, INT heroes heal the most, AGI heroes cause maximum tower damage, STR heroes cause highest stuns and AGI heroes have highest last hits.

Percentage of Wins

While total_wins is available for each account_id, simply looking at total_wins is inadequate. More accurate is to look at percentage_wins. This is possible since we know both total_wins and total_matches.

Using this metric, we plot a histogram of percentage of wins. Highest occurence of percentage_wins is 0.5, showing that maximum players win half their games.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Moreover, an even larger proportion of players win between 40-60% of the games they play.

Correlation between trueskill_mu and percentage_wins

trueskill_mu is a Microsoft Research algorithm that measures player skill. To check its accuracy, we correlate it with percentage_wins. Graph shows that higher skill level implies higher win percentage. Since player skill is hard to measure, a standard deviation measure of player ability is included. The different colors of the below dots gives the different standard deviations.

While there is a definite correlation between trueskill_mu and percentage_wins, trueskill_sigma does not have a definite pattern, except for the dark blue shades in the center of the plot and in the top right of the plot.

Boxplot of item types and hero levels at end of game

Below plot, drawn between item_0 and hero level, shows the Dam (Damage) item type to have highest level median, followed by Disable. Int item types have the lowest median level.

Similar plots were made for item_1 to item_5 versus hero level. All plots had very similar final outputs.

Based on this, it can be inferred that Damage and Disable class items lead to the highest hero median level while Int item types lead to the lowest hero median level. The opposite of this statement could also be true. It is possible that Damage and Disable items are late game items that can only be bought by heroes that have substantially levelled up. The next few sections will try to infer if higher hero levels are caused by specific items or if heroes are able to buy these specific items only after reaching those higher levels.

Lineplot of the percentage of items acquired by each hero type for each item slot location

Below line plot shows percentage of item classes acquired by each hero type. For instance, more than 35% of AGI heroes acquire 2 Class items in the item_0 inventory slot. The trend for different item slots is similar, as seen from the shape of the different line graphs.

## Warning: Removed 8 rows containing missing values (geom_point).

From the line plot, it can be seen that 2 Class items are most preferred by all hero types. STR heroes have a high preference for Mob (mobility) items. This is obvious since STR heroes tend to be slow moving.

Does high median hero level lead to purchase of damage/disable items or does purchase of damage/disable items lead to high median hero level?

The above line graph was obtained after excluding item classes that constituted less than 5% of total observations per hero. Since this filter has excluded damage or disable items, it is obvious that heroes bought standalone damage or disable items less than 5% of the time. If the strategy of buying solely damage or disable items were successfull, it is likely that more players would have chosen to buy standalone damage or disable items. This graph shows that only very few players chose to buy these items. So, it is concluded that only heroes that have already attained high levels buy damage or disable items. Buying damage or disable items cannot be attributed as a reason for high hero level.

Identify heroes that win most games

On an absolute basis, INT heroes win the maximum number of games, followed by STR and then AGI heroes. This is true for both team types.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Warning: Ignoring unknown parameters: binwidth, bins, pad

INT heroes have won maximum number of games followed by STR heroes. This trend is the same for both radiant and dire teams.

Identify heroes that have highest probability of winning

STR heroes have highest probability of winning, followed by AGI and then INT. The difference in winning probabilities between the three heroes is negligible.

Team hero combinations and winnability

Teams that have combination of >=3 INT heroes win more games, followed by teams that have combination of >= 3 STR heroes. It is inadequate to look at the absolute number of games won by teams that have multiple heroes of same type since it is possible that higher wins for teams with >=3 INT heroes maybe due to the fact that many more teams choose combination of >= 3 INT heroes. So, the probability of a team winning, when it has >= 3 INT, >= 3 STR or >= 3 AGI heroes is calculated.

Above plot shows that teams which pick >= 3 STR heroes have a higher probability of winning, followed by teams that pick >= 3 AGI heroes. This topic is further explored in the logistic regression area of the project.

Association rules between hero type and item purchase

Item Frequency Plot

Below plot shows that 2 Class items are the most commonly purchased items. This is followed by miscellaneous, mobility and health. The fact that 2 Class items are bought highest matches the earlier line plot, which showed that all three hero types have very high preference for 2 Class items.

General rule output

Association rules are most commonly used by the retail industry. They show the probability of a person buying item X, if they previously bought item Y in the same trip. For the capstone project, association rules are used to analyze the items purchased by STR, AGI and INT heroes.

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 318 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[371 item(s), 318831 transaction(s)] done [0.44s].
## sorting and recoding items ... [196 item(s)] done [0.10s].
## creating transaction tree ... done [0.68s].
## checking subsets of size 1 2 3 4 5 6 done [2.08s].
## writing ... [206 rule(s)] done [0.02s].
## creating S4 object  ... done [0.10s].

##      lhs                 rhs         support confidence lift
## [1]  {item_0=2 Class,                                       
##       item_2=Agi,                                           
##       item_3=2 Class} => {Class=AGI}  0.0011       0.96  3.2
## [2]  {item_0=2 Class,                                       
##       item_1=Agi,                                           
##       item_2=2 Class} => {Class=AGI}  0.0012       0.96  3.1
## [3]  {item_0=2 Class,                                       
##       item_1=2 Class,                                       
##       item_2=Agi}     => {Class=AGI}  0.0014       0.94  3.1
## [4]  {item_0=2 Class,                                       
##       item_1=2 Class,                                       
##       item_4=Agi}     => {Class=AGI}  0.0016       0.94  3.1
## [5]  {item_0=2 Class,                                       
##       item_1=Agi,                                           
##       item_3=2 Class} => {Class=AGI}  0.0010       0.94  3.1
## [6]  {item_0=2 Class,                                       
##       item_1=3 Class,                                       
##       item_4=Agi}     => {Class=AGI}  0.0011       0.93  3.1
## [7]  {item_0=2 Class,                                       
##       item_1=2 Class,                                       
##       item_4=Life}    => {Class=AGI}  0.0015       0.93  3.1
## [8]  {item_0=2 Class,                                       
##       item_2=2 Class,                                       
##       item_4=Agi}     => {Class=AGI}  0.0014       0.93  3.1
## [9]  {item_0=2 Class,                                       
##       item_1=2 Class,                                       
##       item_3=Life}    => {Class=AGI}  0.0014       0.93  3.1
## [10] {item_0=Agi,                                           
##       item_1=2 Class,                                       
##       item_2=2 Class} => {Class=AGI}  0.0012       0.93  3.1

Rule output shows that AGI heroes have highest rule confidence. INT and STR heroes do not even feature in the top 10 rules.

STR specific rule output

##      lhs                 rhs         support confidence lift
## [1]  {item_1=Mob,                                           
##       item_2=2 Class,                                       
##       item_3=4 Class} => {Class=STR}  0.0011       0.75  2.3
## [2]  {item_0=Mob,                                           
##       item_1=4 Class,                                       
##       item_5=2 Class} => {Class=STR}  0.0011       0.75  2.3
## [3]  {item_0=Mob,                                           
##       item_3=4 Class,                                       
##       item_5=2 Class} => {Class=STR}  0.0011       0.73  2.3
## [4]  {item_0=Mob,                                           
##       item_2=2 Class,                                       
##       item_3=4 Class} => {Class=STR}  0.0017       0.73  2.3
## [5]  {item_0=Mob,                                           
##       item_2=4 Class,                                       
##       item_5=2 Class} => {Class=STR}  0.0011       0.72  2.2
## [6]  {item_0=Mob,                                           
##       item_3=4 Class,                                       
##       item_4=2 Class} => {Class=STR}  0.0012       0.72  2.2
## [7]  {item_2=2 Class,                                       
##       item_4=Armour}  => {Class=STR}  0.0013       0.72  2.2
## [8]  {item_0=Mob,                                           
##       item_1=4 Class,                                       
##       item_2=2 Class} => {Class=STR}  0.0017       0.72  2.2
## [9]  {item_0=Mob,                                           
##       item_2=2 Class,                                       
##       item_5=4 Class} => {Class=STR}  0.0012       0.72  2.2
## [10] {item_3=2 Class,                                       
##       item_5=Armour}  => {Class=STR}  0.0013       0.72  2.2

STR hero rules have highest rule confidence of only 0.75.

Machine Learning

Linear regression, logistic regression and decision tree is used in the machine learning part of the capstone project. Linear regression was used to predict numeric values of trueskill_mu. Logistic regression was used to predict the probability of radiant or dire teams winning. Finally, a decision tree or CART is used to perform the same prediction as the logistic regression.

Linear Regression (LR)

Correlated Variables

Before performing the linear regression, a correlation plot was generated.

First LR model

trueskill_mu was predicted after removing all correlated variables. Following summary output was obtained.

## 
## Call:
## lm(formula = trueskill_mu ~ trueskill_sigma + gold + kills + 
##     deaths + assists + denies + last_hits + hero_healing + tower_damage + 
##     level + xp_other + gold_destroying_structure + player_slot_Radiant + 
##     player_slot_Dire + Class_STR + Class_AGI + Class_INT, data = Combined_LR_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.647  -2.643  -0.024   2.603  20.611 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.81e+01   1.07e-01  263.78  < 2e-16 ***
## trueskill_sigma           -3.22e-01   1.10e-02  -29.43  < 2e-16 ***
## gold                      -2.02e-05   1.13e-05   -1.78    0.074 .  
## kills                      1.00e-02   4.22e-03    2.37    0.018 *  
## deaths                    -2.68e-02   4.97e-03   -5.40  6.8e-08 ***
## assists                    1.34e-02   3.36e-03    3.99  6.5e-05 ***
## denies                     3.99e-03   2.88e-03    1.39    0.165    
## last_hits                  1.34e-03   2.76e-04    4.85  1.2e-06 ***
## hero_healing               6.17e-05   1.35e-05    4.58  4.6e-06 ***
## tower_damage               1.33e-05   1.49e-05    0.89    0.373    
## level                     -4.72e-02   7.75e-03   -6.09  1.1e-09 ***
## xp_other                   3.11e-05   1.45e-05    2.16    0.031 *  
## gold_destroying_structure -9.28e-06   1.55e-05   -0.60    0.551    
## player_slot_Radiant        6.65e-01   8.15e-01    0.82    0.414    
## player_slot_Dire           6.62e-01   8.15e-01    0.81    0.416    
## Class_STR                 -6.43e-01   8.15e-01   -0.79    0.430    
## Class_AGI                 -6.69e-01   8.15e-01   -0.82    0.412    
## Class_INT                 -6.63e-01   8.15e-01   -0.81    0.416    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4 on 87330 degrees of freedom
## Multiple R-squared:  0.0142, Adjusted R-squared:  0.0141 
## F-statistic: 74.2 on 17 and 87330 DF,  p-value: <2e-16

Output shows a very low R^2 value for the model. Also, the player`s team choice and hero choice are not statistically significant.

Second LR model

All variables that were statistically not significant in the first LR model were removed and the second model was built. Following output was obtained.

## 
## Call:
## lm(formula = trueskill_mu ~ trueskill_sigma + gold + deaths + 
##     assists + last_hits + hero_healing + xp_other, data = Combined_LR_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.652  -2.643  -0.035   2.605  20.596 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.78e+01   7.33e-02  378.63  < 2e-16 ***
## trueskill_sigma -3.33e-01   1.01e-02  -32.99  < 2e-16 ***
## gold            -2.83e-05   1.04e-05   -2.71   0.0067 ** 
## deaths          -3.32e-02   4.50e-03   -7.38  1.5e-13 ***
## assists          2.67e-03   2.70e-03    0.99   0.3233    
## last_hits        4.12e-04   1.64e-04    2.52   0.0118 *  
## hero_healing     5.73e-05   1.33e-05    4.30  1.7e-05 ***
## xp_other         1.80e-05   1.42e-05    1.27   0.2058    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4 on 87340 degrees of freedom
## Multiple R-squared:  0.0137, Adjusted R-squared:  0.0136 
## F-statistic:  174 on 7 and 87340 DF,  p-value: <2e-16

Now we have obtained a model where most variables are significant.

Residuals versus Fitted plot has some correlation between the two variables, shown by the parabolic shape of the curve.Normal Q-Q plot shows that residuals are normally distributed, based on the plot`s straight line curve.Square root standardized residuals versus fitted plot also has some correlation between the two variables.

It can be concluded that the linear regression is neither a great nor a bad model. It is a mediocre model with some flaws.

Logistic Regression (LogR)

Logistic regression is used to predict the probability of radiant or dire team winning. To do so, the data is grouped by match_id and player_slot and all the columns are summarised by taking the sum of all column observations.

Correlated Variables

Before performing logistic regression, a correlation plot was generated.

First LogR model

radiant_win was predicted after removing all correlated variables. Following output was obtained.

## 
## Call:
## glm(formula = radiant_win ~ gold + kills + deaths + denies + 
##     last_hits + stuns + hero_healing + tower_damage + xp_other + 
##     gold_other + trueskill_mu + trueskill_var + Class_STR + Class_AGI + 
##     Class_INT + duration + first_blood_time + cluster, family = binomial, 
##     data = LogR_Train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.55   -1.21    1.02    1.13    1.73  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       6.33e-01   8.06e-02    7.86  3.9e-15 ***
## gold             -5.19e-06   2.55e-06   -2.04  0.04179 *  
## kills             6.47e-03   1.08e-03    5.99  2.1e-09 ***
## deaths           -6.23e-03   1.18e-03   -5.29  1.2e-07 ***
## denies           -1.62e-03   8.45e-04   -1.92  0.05480 .  
## last_hits         1.68e-04   6.93e-05    2.42  0.01559 *  
## stuns            -1.71e-04   1.47e-04   -1.16  0.24677    
## hero_healing     -1.37e-05   3.94e-06   -3.49  0.00049 ***
## tower_damage     -2.84e-05   3.54e-06   -8.02  1.1e-15 ***
## xp_other         -2.79e-06   3.77e-06   -0.74  0.45855    
## gold_other       -1.26e-07   3.98e-06   -0.03  0.97484    
## trueskill_mu     -2.20e-04   1.18e-03   -0.19  0.85148    
## trueskill_var     2.99e-06   1.83e-04    0.02  0.98692    
## Class_STR         3.79e-02   3.71e-02    1.02  0.30720    
## Class_AGI         3.80e-02   3.75e-02    1.01  0.31189    
## Class_INT         4.05e-02   3.65e-02    1.11  0.26763    
## duration         -2.44e-04   2.36e-05  -10.37  < 2e-16 ***
## first_blood_time  7.13e-05   8.74e-05    0.82  0.41464    
## cluster           5.82e-04   3.31e-04    1.76  0.07843 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 88460  on 63875  degrees of freedom
## Residual deviance: 87925  on 63857  degrees of freedom
## AIC: 87963
## 
## Number of Fisher Scoring iterations: 4

Output again shows that hero choice is not statistically significant. Even player skill is not significant in determining probability of radiant or dire winning according to the logistic regression.

Second LogR model

All variables that were not statistically significant in the first logistic regression model were removed and the second model was built.

## 
## Call:
## glm(formula = radiant_win ~ gold + kills + deaths + last_hits + 
##     hero_healing + tower_damage + duration, family = binomial, 
##     data = LogR_Train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.53   -1.21    1.02    1.13    1.75  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   8.04e-01   3.54e-02   22.67  < 2e-16 ***
## gold         -4.36e-06   2.48e-06   -1.76  0.07879 .  
## kills         6.49e-03   1.02e-03    6.35  2.2e-10 ***
## deaths       -5.14e-03   8.63e-04   -5.96  2.6e-09 ***
## last_hits     1.86e-04   5.53e-05    3.36  0.00078 ***
## hero_healing -1.22e-05   3.77e-06   -3.22  0.00127 ** 
## tower_damage -2.87e-05   3.51e-06   -8.17  3.0e-16 ***
## duration     -2.70e-04   1.52e-05  -17.74  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 88460  on 63875  degrees of freedom
## Residual deviance: 87936  on 63868  degrees of freedom
## AIC: 87952
## 
## Number of Fisher Scoring iterations: 4

Very high AIC shows that this model is also not very useful for drawing larger conclusions.

Accuracy of second LogR model

##    0    1 
## 0.51 0.52

##    
##     FALSE  TRUE
##   0 10368 20366
##   1  9302 23840

Model accuracy using the training set equals 53.5%.

Third LogR model

A third logistic regression model uses radiant_win as the dependant variable and >=3 STR (radiant), >=3 STR (dire), >= 3 AGI (radiant), >= 3 AGI (dire), >= 3 INT (radiant), >= 3 INT (dire) as binary independant variables. This dataset is taken for all 50000 matches where radiant_win is 1, for radiant winning and 0, for dire winning. If the radiant team has used >=3 STR heroes, the n_R_STR variable is assigned 1, else 0. Similarly for other 5 variables.

## 
## Call:
## glm(formula = radiant_win ~ n_R_STR + n_R_AGI + n_R_INT + n_D_STR + 
##     n_D_AGI + n_D_INT, family = binomial, data = Combined_LogR_Greaterequal3_4)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
##  -1.32   -1.21    1.04    1.15    1.24  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.0657     0.0159    4.14  3.5e-05 ***
## n_R_STR       0.0958     0.0253    3.79  0.00015 ***
## n_R_AGI      -0.0724     0.0278   -2.61  0.00918 ** 
## n_R_INT      -0.1334     0.0232   -5.74  9.5e-09 ***
## n_D_STR      -0.0801     0.0254   -3.15  0.00166 ** 
## n_D_AGI       0.0574     0.0282    2.04  0.04147 *  
## n_D_INT       0.1707     0.0231    7.40  1.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 69244  on 49999  degrees of freedom
## Residual deviance: 69082  on 49993  degrees of freedom
## AIC: 69096
## 
## Number of Fisher Scoring iterations: 3

Logistic regression shows that all 6 variables are statistically significant. n_R_STR`s positive co-efficient shows that a radiant team having >=3 STR heroes increases its chances of winning. n_D_STR has a negative co-efficient, implying that when dire teams choose >=3 STR heroes, the chances of radiant team winning decreases.

It maybe noted that the third logistic regression model had a substantially lower AIC (69096), compared to the first (87963) and second (87952) logistic regression models. Lower AIC implies better model. So, the variables relating to >= 3 hero types seem to predict probability of winning better than the variables used in the first and second logistic regression models.

Decision Tree

Decision tree model is used to predict the probability of radiant(1) or dire(0) teams winning. The distinctive branches can be interpreted as follows: if a dire team has gold >= 11000 and if the game duration did not exceed 2054, the dire team is more likely to win the game. This rule stays the same for a radiant team also.

If a radiant team had tower damage < 2082, did not have gold >= 11000, game duration was not >= 2054, then team dire is more likely to win that game.

All leaves to the left of a splitter variable are YES and those to the right of a splitter variables are NO. This is similar to the YES and NO shown on the duration splitter at the very top of the decision tree.

Cross Validate decision tree

## CART 
## 
## 63876 samples
##    21 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 57489, 57488, 57489, 57489, 57489, 57488, ... 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy  Kappa
##   0.01  0.87      0.75 
##   0.02  0.83      0.65 
##   0.03  0.81      0.62 
##   0.04  0.75      0.49 
##   0.05  0.74      0.47 
##   0.06  0.73      0.45 
##   0.07  0.73      0.45 
##   0.08  0.73      0.45 
##   0.09  0.61      0.19 
##   0.10  0.52      0.00 
##   0.11  0.52      0.00 
##   0.12  0.52      0.00 
##   0.13  0.52      0.00 
##   0.14  0.52      0.00 
##   0.15  0.52      0.00 
##   0.16  0.52      0.00 
##   0.17  0.52      0.00 
##   0.18  0.52      0.00 
##   0.19  0.52      0.00 
##   0.20  0.52      0.00 
##   0.21  0.52      0.00 
##   0.22  0.52      0.00 
##   0.23  0.52      0.00 
##   0.24  0.52      0.00 
##   0.25  0.52      0.00 
##   0.26  0.52      0.00 
##   0.27  0.52      0.00 
##   0.28  0.52      0.00 
##   0.29  0.52      0.00 
##   0.30  0.52      0.00 
##   0.31  0.52      0.00 
##   0.32  0.52      0.00 
##   0.33  0.52      0.00 
##   0.34  0.52      0.00 
##   0.35  0.52      0.00 
##   0.36  0.52      0.00 
##   0.37  0.52      0.00 
##   0.38  0.52      0.00 
##   0.39  0.52      0.00 
##   0.40  0.52      0.00 
##   0.41  0.52      0.00 
##   0.42  0.52      0.00 
##   0.43  0.52      0.00 
##   0.44  0.52      0.00 
##   0.45  0.52      0.00 
##   0.46  0.52      0.00 
##   0.47  0.52      0.00 
##   0.48  0.52      0.00 
##   0.49  0.52      0.00 
##   0.50  0.52      0.00 
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.01.

##    Combined_Tree_Model_CP_Predict
##         0     1
##   0 11728  1444
##   1  1896 12308

Final decision tree model gives an accuracy of 87.7%, substantially higher than the logistic regression.

Conclusions

Results from the capstone project show that

The act of hero selection makes no difference to gameplay and winnability. This is proven by the fact that hero variables are not statistically significant in the linear and logistic regression models.
The act of team selection alone makes no difference to the probability of winning. This is inferred from the fact that radiant and dire variables are not statistically significant in the logistic regression model.
The collective act of team selection is statistically significant when >= 3 heroes of the same type are chosen. This is especially true for STR heroes.
trueskill_sigma variable is not used as a splitter variable in the decision tree. Similarly, in the scatter plot between trueskill_mu and percentage_wins where trueskill_sigma is used as color variable, the trueskill_sigma variable has some prominence in the lower left corner and upper right corner of the plot. Overall, there is no perceivable shape for trueskill_sigma in the scatter plot.
There is a very high probability of AGI heroes buying 2 Class items. As a matter of fact, rule association for AGI heroes is higher than both STR and INT heroes. Similarly, there is a very high probability of STR heroes buying Mob (mobility) items. This makes sense since STR heroes are slow moving.

Some recommendations

Given that duration is a significant variable in the logistic regression and is used as a splitter variable in the decision tree, it is important to ensure that game duration has no impact on game outcome. Logistic regression shows that duration has a negative co-efficient. So, lower duration games decrease radiant team`s probability of winning, giving dire teams an advantage. Instituting a bonus for radiant teams in shorter duration games can help overcome this bias.
trueskill_sigma is not a splitter variable in the decision tree. trueskill_sigma does not have an impact on the scatter plot between trueskill_mu and percentage_wins. trueskill_sigma is only significant in the linear regression where trueskill_mu is the predictor. The trueskill algorithm requires review in the context of Dota 2.
Teams with >=3 STR heroes seem to be at a significant advantage. A relook into this aspect might be useful to building a fairer Dota 2.

Springboard Foundations of Data Science - Capstone Final Report

Arun Bharadwaj

September 7, 2017

Dota 2 Capstone

Understanding the effect of team selection, hero selection and item purchase on Dota 2 gameplay

Introduction

Approaching the Capstone

Problem Statement

Why and for whom is this problem statement important

What specific questions must be answered, with respect to the dataset, to solve this problem statement

Important CSV files

Data Wrangling

Identify most important csv files

Eliminate columns that do not add value to analysis

Check if data points of some important variables make sense

Convert incorrect negative values to 0

Why not convert incorrect negative values to NA

Nullify columns with very high percentage of NA values

Feature Engineering

Check if some small csv files can be combined with larger csv files to make it easier to visualize data

Classify hero_id column in players_csv to STR, AGI and INT

Remove rows in players_csv and players_ratings_csv where account_id = 0

Replace 0 to 4 and 128-132 in players_csv1 to Radiant and Dire

Convert observations in columns item_0 to item_5 in players_csv3, from numbers to class of items to reduce number of factors

player_ratings_csv2 created which only includes account_id`s that have played 5 games or more

Exploratory Data Analysis

Hero Stats

Percentage of Wins

Correlation between trueskill_mu and percentage_wins

Boxplot of item types and hero levels at end of game

Lineplot of the percentage of items acquired by each hero type for each item slot location

Does high median hero level lead to purchase of damage/disable items or does purchase of damage/disable items lead to high median hero level?

Identify heroes that win most games

Identify heroes that have highest probability of winning

Team hero combinations and winnability

Association rules between hero type and item purchase

Item Frequency Plot

General rule output

STR specific rule output

Machine Learning

Linear Regression (LR)

Correlated Variables

First LR model

Second LR model

Logistic Regression (LogR)

Correlated Variables

First LogR model

Second LogR model

Accuracy of second LogR model

Third LogR model

Decision Tree

Cross Validate decision tree

Conclusions

Some recommendations