A dataset containing a wide variety of information posted to Kaggle by user Devin www.kaggle.com/devinanzelmo/dota-2-matches has a very interesting range of variables set out across several CSV files. Devin has obtained this data from a data mining website called www.opendota.com which collects match data from the Dota2 game servers. This study will mainly focus on the analysis of two of these files:
match.csv - contains 50,000 observations over 10 variable
players.csv - contains 500,000 observations over 73 variables
Other useful files that will support this work include:
hero_names.csv - originally contained 112 observations over 3 variables. This file will have more variables added based upon defined roles and classes from dota2.gamepedia.com/Role.
## Observations: 50,000
## Variables: 13
## $ match_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ start_time <int> 1446750112, 1446753078, 1446764586, 14...
## $ duration <int> 2375, 2582, 2716, 3085, 1887, 1574, 21...
## $ tower_status_radiant <int> 1982, 0, 256, 4, 2047, 2047, 1972, 204...
## $ tower_status_dire <int> 4, 1846, 1972, 1924, 0, 4, 0, 0, 1982,...
## $ barracks_status_dire <int> 3, 63, 63, 51, 0, 3, 3, 0, 63, 63, 51,...
## $ barracks_status_radiant <int> 63, 0, 48, 3, 63, 63, 63, 63, 0, 0, 63...
## $ first_blood_time <int> 1, 221, 190, 40, 58, 113, 4, 255, 4, 8...
## $ game_mode <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
## $ radiant_win <fctr> True, False, False, False, True, True...
## $ negative_votes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ positive_votes <int> 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cluster <int> 155, 154, 132, 191, 156, 155, 151, 138...
| Variable | Description |
|---|---|
| match_id | unique integer to identify matches across all files |
| start_time | match start time in seconds since 00:00, 01/01/1970, UTC |
| duration | length of match in seconds |
| tower_status_radiant | total health of all Radiant towers at match end |
| tower_status_dire | total health of all Dire towers at match end |
| barracks_status_dire | total health of all Dire barracks at match end |
| barracks_status_radiant | total health of all Radiant barracks at match end |
| first_blood_time | timestamp in seconds of when first player is killed from match start |
| game_mode | an integer indicating which game mode the match is playing |
| radiant_win | a factor indicating if the radiant team won or lost the match |
| negative_votes | number of negative votes a match received |
| positive_votes | number of positive votes a match received |
| cluster | integer indicating which region of the world the match was played |
Let’s look at the histograms of some of these variables. The variables ‘match_id’, ‘game_mode’, ‘radiant_win’, ‘negative_votes’, ‘positive_votes’ and ‘cluster’ will not be examined by histogram.
Taking a quick look at these variables in a histogram setting shows some interesting features.
Start_time represents when each match started in time in seconds since 00:00, 01/01/1970, UTC. It is noted that the bulk of matches recorded reside in the second half of the ‘start_time’ range. When asking Devin why this might be he responded saying he believes this to be due to the way Opendota.com sampled the Dota 2 game servers (please see www.kaggle.com/louisms/discussion). The change in frequency represents a moment when Opendota.com improved how it sampled games to capture more of them. This discrepancy does not affect how each individual game recorded was played and so none will be excluded based upon this.
‘Duration’ seems to show a fairly normal distribution between roughly 1000 and 5000 seconds, however it is noted that there is some data residing above 16000 seconds, which is most likely outlier data representing games that were not played at a competitive level. It is also noted that the shortest game is 59 seconds long, clearly not a competitive game. Let’s take a closer look at its distribution.
It is important for this study to take the matches that were most likely played competitively. Duration is a key indicator for this and so upon researching average game lengths in Dota 2 forums it is generally agreed that matches on average last between 35-45 minutes long (2100 - 2700 seconds) with matches lasting 25 minutes (1500 seconds) considered ‘short’ and 60 minutes (3600 seconds) considered ‘long’ (link). For this study we will only look at matches that last between 15 - 75 minutes (900 - 4500 seconds).
## Observations: 49,678
## Variables: 14
## $ match_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ start_time <int> 1446750112, 1446753078, 1446764586, 14...
## $ duration <int> 2375, 2582, 2716, 3085, 1887, 1574, 21...
## $ tower_status_radiant <int> 1982, 0, 256, 4, 2047, 2047, 1972, 204...
## $ tower_status_dire <int> 4, 1846, 1972, 1924, 0, 4, 0, 0, 1982,...
## $ barracks_status_dire <int> 3, 63, 63, 51, 0, 3, 3, 0, 63, 63, 51,...
## $ barracks_status_radiant <int> 63, 0, 48, 3, 63, 63, 63, 63, 0, 0, 63...
## $ first_blood_time <int> 1, 221, 190, 40, 58, 113, 4, 255, 4, 8...
## $ game_mode <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
## $ radiant_win <fctr> True, False, False, False, True, True...
## $ negative_votes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ positive_votes <int> 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cluster <int> 155, 154, 132, 191, 156, 155, 151, 138...
## $ date <dttm> 2015-11-05 19:01:52, 2015-11-05 19:51...
This has removed 322 matches leaving 49678 matches in the study with the following distribution of ‘duration’.
The variables recording each sides’ tower status and barracks status at match end have similar distributions. When looking at their difference (‘Tower_Diff’ and ‘Barracks_Diff’) we see bimodal distributions which are expected. As one team wins their towers and barracks are expected to have much more health than their enemies’ towers and hence the bimodal distributions. There are no reasons to modify the dataset based upon these variables. Further to this these variables may be able to define each match into a win status such as ‘strong_win’, ‘strong_loss’ and ‘close_game’.
When comparing the difference in towers’ health between the teams with duration we find this interesting plot.
The x-axis represents match duration in seconds and the y-axis represents the difference between the sum of total tower health for each side when each match ends, with positive values representing the Radiant sides’ towers having more health and negative values representing the Dire sides’ towers having more health. The colour differentiates which side won from the perspective of the Radiant side.
An interesting observation from this initial plot is that there are cases where although one teams’ towers have more health they do not necessarily win the match, which is why that within the +500 to -500 TowerStat_R range we find both teams winning and losing matches. This can be used to group each match into either a dominant win or a close win for either side.
The variable ‘first_blood’ displays a right-skewed distribution which is expected since it would be rare for a game to go on too long without a single player dying. This variable maybe able to help define the win status as we might expect more earlier ‘first_blood’ kills recorded in strong win/loss scenarios and later ‘first_blood’ kills in closer games.
Here are other important variables to include that were not part of the histogram process above.
The ‘match_id’ variable is an integer number that Devin updated with his own numbers from 0 to 49,999, which we also find in ‘players’ data frame and are consistent across all files. As discussed with Devin (see here https://www.kaggle.com/louisms/discussion see ‘A quick look at Dota 2 dataset’) he did this to save space in his files since the original ‘match_id’ values from Dota 2 servers are much longer. This variable will be key to linking the ‘match’ and ‘players’ data frames together.
The variable ‘game_mode’ was investigated as so.
## # A tibble: 2 × 2
## game_mode n
## <int> <int>
## 1 2 1316
## 2 22 48362
This shows that there were 1316 matches played as game mode ‘2’ and 48362 as game mode ‘22’. It may be worth exploring if there is any discernable difference between the two mode and whether this difference is enough to include or exclude matches under game mode ‘2’. It may depend upon the question as it is suspected that questions such as ‘which hero is most likely to create a win condition’ maybe independent of game type and as such no matches should be excluded based upon this. For now this variable will not be used to exclude any matches.
The variable ‘radiant_win’ is a logical factor with either ‘True’ or ‘False’ indicating if the Radiant team won or lost the match. This variable is crucial for identifying who won each match however a new variable should be made so that it identifies if a team won or lost a match. This will allow for linking individual players and heroes to wins and losses. This will be done after ‘match’ and ‘players’ data frames are merged because information from both are required to determine this.
These variables are not considered to be influential on the win condition of a match, other than each player’s view as to how friendly other players were in the game. These variables will not initially be included in the dataset, but will be remembered in case there is reason to return them to the study.
The ‘cluster’ variable refers to where in the world these games were hosted and is not relevant to the win condition for each match. This variable will not be included in the dataset.
For this study the following variables from ‘match’ will be used:
1 match_id 2 start_time 3 duration 4 tower_status_radiant 5 tower_status_dire 6 barracks_status_dire 7 barracks_status_radiant 8 first_blood_time 9 game_mode 10 radiant_win 14 date
# Collect key variables from 'match' dataset
MatchDF <- MatchNewDur %>% select(1 : 10, 14)
## Observations: 500,000
## Variables: 73
## $ match_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ account_id <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6...
## $ hero_id <int> 86, 51, 83, 11, 67, 106, 102...
## $ player_slot <int> 0, 1, 2, 3, 4, 128, 129, 130...
## $ gold <int> 3261, 2954, 110, 1179, 3307,...
## $ gold_spent <int> 10960, 17760, 12195, 22505, ...
## $ gold_per_min <int> 347, 494, 350, 599, 613, 397...
## $ xp_per_min <int> 362, 659, 385, 605, 762, 524...
## $ kills <int> 9, 13, 0, 8, 20, 5, 4, 4, 1,...
## $ deaths <int> 3, 3, 4, 4, 3, 6, 13, 8, 14,...
## $ assists <int> 18, 18, 15, 19, 17, 8, 5, 6,...
## $ denies <int> 1, 9, 1, 6, 13, 5, 2, 31, 0,...
## $ last_hits <int> 30, 109, 58, 271, 245, 162, ...
## $ stuns <fctr> 76.7356, 87.4164, None, Non...
## $ hero_damage <int> 8690, 23747, 4217, 14832, 33...
## $ hero_healing <int> 218, 0, 1595, 2714, 243, 0, ...
## $ tower_damage <int> 143, 423, 399, 6055, 1833, 1...
## $ item_0 <int> 180, 46, 48, 63, 114, 145, 5...
## $ item_1 <int> 37, 63, 60, 147, 92, 73, 11,...
## $ item_2 <int> 73, 119, 59, 154, 147, 149, ...
## $ item_3 <int> 56, 102, 108, 164, 0, 48, 36...
## $ item_4 <int> 108, 24, 65, 79, 137, 212, 1...
## $ item_5 <int> 0, 108, 0, 160, 63, 0, 81, 2...
## $ level <int> 16, 22, 17, 21, 24, 19, 16, ...
## $ leaver_status <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ xp_hero <dbl> 8840, 14331, 6692, 8583, 158...
## $ xp_creep <dbl> 5440, 8440, 8112, 14230, 143...
## $ xp_roshan <dbl> NA, 2683, NA, 894, NA, NA, N...
## $ xp_other <dbl> 83, 671, 453, 293, 62, 1, 1,...
## $ gold_other <dbl> 50, 395, 259, 100, NA, NA, N...
## $ gold_death <dbl> -957, -1137, -1436, -2156, -...
## $ gold_buyback <dbl> NA, NA, -1015, NA, -1056, -2...
## $ gold_abandon <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ gold_sell <dbl> 212, 1650, NA, 938, 4194, 20...
## $ gold_destroying_structure <dbl> 3120, 3299, 3142, 4714, 3217...
## $ gold_killing_heros <dbl> 5145, 6676, 2418, 4104, 7467...
## $ gold_killing_creeps <dbl> 1087, 4317, 3697, 10432, 922...
## $ gold_killing_roshan <dbl> 400, 937, 400, 400, 400, NA,...
## $ gold_killing_couriers <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_none <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_move_to_position <dbl> 4070, 5894, 7053, 4712, 3853...
## $ unit_order_move_to_target <dbl> 1, 214, 3, 133, 7, 166, 63, ...
## $ unit_order_attack_move <dbl> 25, 165, 132, 163, 7, 76, 10...
## $ unit_order_attack_target <dbl> 416, 1031, 645, 690, 1173, 8...
## $ unit_order_cast_position <dbl> 51, 98, 36, 9, 31, 196, 13, ...
## $ unit_order_cast_target <dbl> 144, 39, 160, 15, 84, 3, 173...
## $ unit_order_cast_target_tree <dbl> 3, 4, 20, 7, 8, 5, 14, 3, 9,...
## $ unit_order_cast_no_target <dbl> 71, 439, 373, 406, 198, 96, ...
## $ unit_order_cast_toggle <dbl> NA, NA, NA, NA, NA, 2, NA, N...
## $ unit_order_hold_position <dbl> 188, 346, 643, 150, 111, 161...
## $ unit_order_train_ability <dbl> 16, 22, 17, 21, 23, 19, 16, ...
## $ unit_order_drop_item <dbl> NA, NA, 5, NA, 1, NA, NA, 2,...
## $ unit_order_give_item <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_pickup_item <dbl> NA, 12, 7, 1, NA, 2, 1, 18, ...
## $ unit_order_pickup_rune <dbl> 2, 52, 8, 9, 2, NA, 1, 18, 1...
## $ unit_order_purchase_item <dbl> 35, 30, 28, 45, 44, 36, 43, ...
## $ unit_order_sell_item <dbl> 2, 4, NA, 7, 6, 3, 3, 1, NA,...
## $ unit_order_disassemble_item <dbl> NA, NA, 1, NA, NA, NA, NA, N...
## $ unit_order_move_item <dbl> 11, 21, 18, 14, 13, 3, 13, 1...
## $ unit_order_cast_toggle_auto <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_stop <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_taunt <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_buyback <dbl> NA, NA, 1, NA, 1, 2, NA, NA,...
## $ unit_order_glyph <dbl> NA, NA, NA, 1, 3, NA, 4, NA,...
## $ unit_order_eject_item_from_stash <dbl> NA, NA, NA, NA, NA, NA, 1, N...
## $ unit_order_cast_rune <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_ping_ability <dbl> 6, 14, 17, 13, 23, 2, 1, 4, ...
## $ unit_order_move_to_direction <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_patrol <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_vector_target_position <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_radar <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_set_item_combine_lock <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_continue <lgl> NA, NA, NA, NA, NA, NA, NA, ...
| Variables | Descriptions |
|---|---|
| match_id | unique integer to identify matches across all files |
| account_id | unique integer to identify players. 0 (zero) represent anonymous players |
| hero_id | unique integer to identify which hero was selected per player |
| player_slot | unique integer where 0-4 represent players on Radiant side and 128-132 the Dire side |
| gold | amount of gold player has at match end |
| gold_spent | amount of gold player spent during the match |
| gold_per_min | amount of gold per minute a player accumulated on average during the match |
| xp_per_min | amount of experience points per minute a player accumulated on average during the match |
| kills | total kills a player scored per match |
| deaths | number of times a player died per match |
| assists | number of assists a player gained per match |
| denies | number of denies a player performed per match |
| last_hits | number of last hits a player landed on creep per match |
| stuns | cumulative amount of time a player stunned an enemy player |
| hero_damage | cumulative amount of damage a player caused to enemy heroes |
| tower_damage | cumulative amount of damage a player caused to enemy towers |
| item_0 to _5 | integers referring to items a player equipped during the match |
| level | the hero character level a player obtained by match end |
| leaver_status | integer indicating if a player stayed until match end or left early |
| xp_hero | experience points gained from killing heros |
| xp_creep | experience points gained from killing creep |
| xp_roshan | experience points gained from killing Roshan, neutral creep boss in jungle |
| xp_other | experience points gained from other sources |
| gold_other | gold gained from other non-conventional sources |
| gold_death | cumulative amount of gold lost due to death |
| gold_buyback | gold spent buying back lost items |
| gold_abandon | gold abandoned by a player |
| gold_sell | gold gained from selling items |
| gold_destroying_structure | gold gained from destroying structures e.g. tower, barracks, ancient |
| gold_killing_heros | gold gained from killing enemy heroes |
| gold_killing_creep | gold gained from killing creeps |
| gold_killing_roshan | gold gained from killing Roshan, high level neutral creep boss in jungle |
| gold_killing_couriers | gold gained from killing enemy courriers |
The variables starting with ‘unit_order_’ (columns 40 - 73 in ‘players’ data frame) represent the number of physical mouse and keyboard clicks a player did to perform certain actions. The usefulness of these variables are not immediately obvious however an accumulation of all clicks per player per match will be collected since it could be used as a rough indicator of how advanced a player is or how skillful a player needs to be to operate a hero effectively.
The histogram of the variable ‘unit_order_total’ has mainly a normal distribution with a small right skew tail towards 250000. It maybe more informative when examined over time. This will require the duration variable from MatchDF.
Let’s look at the histograms of some of the other key variables.
The ‘players’ data frame contains data from the same 50,000 matches recorded in ‘match’ for each player in each match. Each match has ten players, five per side which is why this data frame has 500,000 observations. As mentioned before the variable ‘match_id’ will be used to link the two main data frames together, ‘match’ and ‘players’.
The variable ‘hero_id’ represents an integer that when linked to ‘hero_names.csv’ identifies which hero each player selected prior to each match. This will be key in determining which heroes are best and which should be avoided when first starting out in the game. It is noted that ‘hero_id’ ranges from ‘r min(players\(hero_id)' to 'r max(players\)hero_id)’, however hero_id 0 is invalid and represents when a hero was not selected. This means that matches with hero_id 0 were unbalanced from the start and should be removed from the data set.
# identifying which matches have hero_id 0 present,
Tally0 <- PlayersShort %>%
group_by(match_id, hero_id) %>%
tally() %>%
filter(hero_id == 0)
glimpse(Tally0)
## Observations: 35
## Variables: 3
## $ match_id <int> 720, 1032, 1108, 2134, 2773, 7098, 7488, 7582, 7831, ...
## $ hero_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ n <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
‘Tally0’ shows us there are 35 matches that contain at least one hero_id 0, which represent unbalanced matches since one side only has only four active players. These shall be removed from the ‘MatchDF’ data set.
# identify all match_ids that have hero_id 0 in them and remove from the data set MatchDF
HeroIDrm <- c(Tally0$match_id)
MatchDF <- MatchDF %>% filter(!(match_id %in% HeroIDrm))
Applying this to MatchDF has only removed one match from this data set since the duration limits previously applied has already filtered most of these matches, as would be expected. When teams are unbalanced it would be expected that the match duration would be outside the expected duration of a competitive match. There are now 49677 matches in this data set.
Let’s take a more detailed look at the cleaned hero_id variable’s distribution.
This histogram shows quite a variance on the number of times each hero is selected every match and shows their popularity within the matches studied here. This will be important when investigating the different win conditions.
The variables ‘gold_per_min’, ‘all_gold_gains’, ‘all_gold_loss’ (by death or purchases), ‘xp_per_min’ (experience points per minute) and ‘all_xp_gains’ are all general statistics not necessarily tied to PvP combat that are believed to have influence on players’ performances and contribute to win conditions. These are normal distributions and all gold and xp variables will be cross-examined to see if they support any models that might help determine how well a team performed towards achieving win conditions.
The variables ‘kills’ and ‘deaths’ are player-vs-player (PvP) combat statistics collected per player which represent the number of enemy player kills scored per match and the number of deaths the player suffered by enemy hands per match. The number of assists per match represents the number of enemy player deaths each player significantly contributed to, but did not land the final hit on. These variables will help to indicate how well each player’s performance is relative to all the other players in the match. Further investigations could then link each player by their player_id assignment previously mentioned. These show normal distributions and there are no reasons to exclude any data based upon these variables.
The variable ‘denies’ is a count of how many allied creep below 50% a player has killed to stop enemy players scoring last hits and gaining xp and gold. This is another variable that could be studied in combination with others to see if this has any influence on the win condition. This shows a right-skewed distribution which is to be expected as denying enemy players of their creep kills is rare showing a bias towards the lower values here.
Last hits represents the number of last hits scored by a player on creep. Last hit is important to track since players are required to perform last hits on creep to gain xp and gold from the kill. This is how denies work in denying enemies from gaining xp and gold. The distribution here is more normal than the denies distribution with a bit of right-skew as it is generally more sought after by players.
The hero damage distribution shows a normal distribution and represents the cumulative amount of damage a player has caused to enemy players. This will be useful for determining hero effectiveness against enemy players.
The hero healing variable represents the amount of healing a player has performed upon allied heroes. The distribution is a very narrow right-skew, mostly due to the fact that not many heroes have the ability to heal. This variable will help to indicate the effectiveness of healing support role heroes where their other variable counts, such as kills, might be lacking.
The tower damage variable indicates how much damage a player caused to enemy towers during the match. The distribution here is right-skewed as expected. There is limited enemy tower health to destroy and it is unlikely that a single hero from one team would be responsible for all the tower damage in a single match.
The level variable indicates what level the hero reached at match end. The distribution is fairly flat above 10. This variable will be helpful when comparing hero levels from winning and losing teams.
The variable ‘account_id’ is a number Devin applied to each new unique player that the dataset recorded. Devin applied his own numbering to each account to save space since the original ‘account_id’ values were much larger. 0 represents a player who is playing anonymously but those recorded from 1 to 158360 represents unique players. No data will be removed based upon this variable.
A look at how often individual players are recorded in the dataset reveals something interesting.
# tally up the number of times a player is recorded in the dataset
PlayerTally <- PlayersShort %>%
group_by(account_id) %>%
tally(sort=TRUE) %>%
filter(!(account_id==0))
glimpse(PlayerTally)
## Observations: 158,360
## Variables: 2
## $ account_id <int> 2701, 2362, 10307, 18680, 4390, 2962, 2487, 6551, 1...
## $ n <int> 71, 60, 59, 59, 58, 57, 54, 54, 54, 54, 53, 51, 51,...
Here we see the most recorded player (account_id 2701) has been recorded in only 71 matches. In a dataset of nearly 50,000 matches it may not be possible to draw significant conclusions regarding player behaviour in this dataset.
The leaver status variable ranges from 0 to 4, where 0 indicates matches where no one left and 1-4 indicate varying degrees of when players left during matches. The important point here is to only include matches where no one left the match. As indicated by the histogram most of the matches did not have any leavers.
# investigate the number of matches where no one left before the end of the match
TallyLS <- PlayersShort %>%
group_by(match_id, leaver_status) %>%
tally() %>%
filter(leaver_status != 0)
glimpse(TallyLS)
## Observations: 9,593
## Variables: 3
## $ match_id <int> 7, 19, 22, 32, 32, 35, 36, 36, 49, 62, 73, 82, 8...
## $ leaver_status <int> 1, 1, 1, 1, 2, 1, 1, 3, 2, 1, 4, 1, 1, 2, 4, 1, ...
## $ n <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...
This shows there are 9593 players who are recorded leaving the match at some point. These matches should be removed from the dataset because even if only one player leaves this represents a loss of force by 20% for the corresponding side. This gives the team with all players present an advantage over the other significantly and will skew all the variables recorded for both teams. It maybe interesting at a later date to compare these matches with leavers to those without to see how much this affects match outcomes.
# identify all match_ids that have leaver_Status > 0 and remove from the data set MatchDF
LSrm <- c(TallyLS$match_id)
MatchDF <- MatchDF %>% filter(!(match_id %in% LSrm))
Applying this filter to MatchDF reduces the number of matches in the data set to 42051 viable matches. This has reduced the dataset to 84% of the original 50000 matches recorded.
The variable ‘player_slot’ contains integers ranging from 0-4 and 128-132 for each match. 0-4 represents the 5 players playing for the Radiant side and 128-132 for those 5 players playing for the Dire side. A new variable will be generated to label which side each player is playing for where 0-4 = Radiant or “R” and 128-132 = Dire or “D”.
# add variable "team" whereby 'player_slot' values 0-4 represent Radiant team ("R") and 128-132 represent Dire team ("D").
PlayersTeam <- PlayersShort
PlayersTeam$team <- ifelse(PlayersTeam$player_slot < 5, "R", "D")
For this study the following 27 variables from the original file ‘players’ will be used:
1 match_id 2 account_id 3 hero_id 4 player_slot 5 gold 6 gold_spent 7 gold_per_min 8 xp_per_min 9 kills 10 deaths 11 assists 12 denies 13 last_hit 15 hero_damage 16 hero_healing 17 tower_damage 24 level 26 xp_hero 27 xp_creep 28 xp_rosan 31 gold_death 35 gold_destroying_structure 36 gold_killing_heros 37 gold_killing_creeps 38 gold_killing_rosan 40 unit_order_total 41 team
# select key variables from player file
PlayersDF <- PlayersTeam %>% select(1 : 13, 15 : 17, 24, 26 : 28, 31, 35 : 38, 40, 41)
## Observations: 112
## Variables: 3
## $ name <fctr> npc_dota_hero_antimage, npc_dota_hero_axe, npc...
## $ hero_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ localized_name <fctr> Anti-Mage, Axe, Bane, Bloodseeker, Crystal Mai...
The data frame ‘Hero_Names’ holds the key for the ‘hero_id’ variable in the ‘players’ data frame. This data frame will be tidied up and prepared for applying the key to the ‘hero_id’ variable. The first column ‘name’ holds duplicate information and will be removed. The remaining two columns will be named accordingly: “hero_id” and “Hero_Names”.
Hero_Names["name"] <- NULL
colnames(Hero_Names) <- c("hero_id", "Hero_Names")
Further examination of the ‘hero_names’ data frame shows that there is no hero_id 24 in the data set and the most recent hero ‘Monkey King’ is not present since this data set is taken from before Monkey King was introduced to the game.
The hero_names.csv file was updated to include each heroes’ roles and class as defined by dota2.gamepedia.com/Role. Each hero can have multiple roles but only one of three classes. The three classes and their descriptions are:
| Hero_Class | Class_Descriptions |
|---|---|
| Strength (STR) | warrior-like class mainly dealing melee damage |
| Agility (AGI) | agile class, fast and harder to hit |
| Intellect (INT) | spell-weilding class |
The hero roles and their descriptions are:
| Roles | Gamepedia_Descriptions |
|---|---|
| Carry | Will become more useful later in the game if they gain a significant gold advantage |
| Disabler | Has a guaranteed disable for one or more of their spells |
| Initiator | Good at starting a teamfight |
| Jungler | Can farm effectively from neutral creeps inside the jungle early in the game |
| Support | Can focus less on amassing gold and items, and more on using their abilities to gain an advantage for the team |
| Durable | Has the ability to last longer in teamfights |
| Nuker | Can quickly kill enemy heroes using high damage spells with low cooldowns |
| Pusher | Can quickly siege and destroy towers and barracks at all points of the game |
| Escape | Has the ability to quickly avoid death |
For more details on roles please see dota2.gamepedia.com/Role link. The data frame Hero_Names was exported as a csv file and modified to include the class and role information in Excel. It now looks like this.
## Observations: 112
## Variables: 14
## $ hero_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Hero_Names <fctr> Anti-Mage, Axe, Bane, Bloodseeker, Crystal Maid...
## $ All_Roles <chr> "Car Nuk Esc", " Dis Ini Jun Dur ", " D...
## $ Carry_Car <fctr> Car, NA, NA, Car, NA, Car, NA, Car, Car, Car, C...
## $ Disabler_Dis <fctr> NA, Dis, Dis, Dis, Dis, Dis, Dis, NA, Dis, Dis,...
## $ Initiator_Ini <fctr> NA, Ini, NA, Ini, NA, NA, Ini, NA, NA, NA, NA, ...
## $ Jungler_Jun <fctr> NA, Jun, NA, Jun, Jun, NA, NA, NA, NA, NA, NA, ...
## $ Support_Sup <fctr> NA, NA, Sup, NA, Sup, NA, Sup, NA, Sup, NA, NA,...
## $ Durable_Dur <fctr> NA, Dur, Dur, NA, NA, NA, NA, NA, NA, Dur, NA, ...
## $ Nuker_Nuk <fctr> Nuk, NA, Nuk, Nuk, Nuk, NA, Nuk, NA, Nuk, Nuk, ...
## $ Pusher_Pus <fctr> NA, NA, NA, NA, NA, Pus, NA, Pus, NA, NA, NA, P...
## $ Escape_Esc <fctr> Esc, NA, NA, NA, NA, NA, NA, Esc, Esc, Esc, NA,...
## $ Role_Count <int> 3, 4, 4, 5, 4, 3, 4, 3, 5, 5, 2, 4, 4, 4, 4, 5, ...
## $ Class <fctr> AGI, STR, INT, AGI, INT, AGI, STR, AGI, AGI, AG...
Each hero is now linked to their unique class and collection of roles as defined by the community at dota2.gamepedia.com/Role link. These will be bound the the PlayerDF data set.
# combine PlayersDF and Hero_Names by the variable 'hero_id'
PlayersDF <- left_join(PlayersDF, Hero_Names, by = "hero_id")
glimpse(PlayersDF)
## Observations: 500,000
## Variables: 40
## $ match_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ account_id <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6, 0, 7, ...
## $ hero_id <int> 86, 51, 83, 11, 67, 106, 102, 46, 7,...
## $ player_slot <int> 0, 1, 2, 3, 4, 128, 129, 130, 131, 1...
## $ gold <int> 3261, 2954, 110, 1179, 3307, 476, 31...
## $ gold_spent <int> 10960, 17760, 12195, 22505, 23825, 1...
## $ gold_per_min <int> 347, 494, 350, 599, 613, 397, 303, 4...
## $ xp_per_min <int> 362, 659, 385, 605, 762, 524, 369, 5...
## $ kills <int> 9, 13, 0, 8, 20, 5, 4, 4, 1, 1, 3, 9...
## $ deaths <int> 3, 3, 4, 4, 3, 6, 13, 8, 14, 11, 4, ...
## $ assists <int> 18, 18, 15, 19, 17, 8, 5, 6, 8, 6, 9...
## $ denies <int> 1, 9, 1, 6, 13, 5, 2, 31, 0, 0, 0, 9...
## $ last_hits <int> 30, 109, 58, 271, 245, 162, 107, 208...
## $ hero_damage <int> 8690, 23747, 4217, 14832, 33740, 107...
## $ hero_healing <int> 218, 0, 1595, 2714, 243, 0, 764, 0, ...
## $ tower_damage <int> 143, 423, 399, 6055, 1833, 112, 0, 2...
## $ level <int> 16, 22, 17, 21, 24, 19, 16, 19, 12, ...
## $ xp_hero <dbl> 8840, 14331, 6692, 8583, 15814, 8502...
## $ xp_creep <dbl> 5440, 8440, 8112, 14230, 14325, 1225...
## $ xp_roshan <dbl> NA, 2683, NA, 894, NA, NA, NA, NA, N...
## $ gold_death <dbl> -957, -1137, -1436, -2156, -1437, -2...
## $ gold_destroying_structure <dbl> 3120, 3299, 3142, 4714, 3217, 320, 3...
## $ gold_killing_heros <dbl> 5145, 6676, 2418, 4104, 7467, 5281, ...
## $ gold_killing_creeps <dbl> 1087, 4317, 3697, 10432, 9220, 6193,...
## $ gold_killing_roshan <dbl> 400, 937, 400, 400, 400, NA, NA, NA,...
## $ unit_order_total <dbl> 5041, 8385, 9167, 6396, 5588, 8197, ...
## $ team <chr> "R", "R", "R", "R", "R", "D", "D", "...
## $ Hero_Names <fctr> Rubick, Clockwerk, Treant Protector...
## $ All_Roles <chr> " Dis Sup Nuk ", " Dis Ini Dur...
## $ Carry_Car <fctr> NA, NA, NA, Car, Car, Car, Car, Car...
## $ Disabler_Dis <fctr> Dis, Dis, Dis, NA, NA, Dis, NA, NA,...
## $ Initiator_Ini <fctr> NA, Ini, Ini, NA, NA, Ini, NA, NA, ...
## $ Jungler_Jun <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Support_Sup <fctr> Sup, NA, Sup, NA, NA, NA, Sup, NA, ...
## $ Durable_Dur <fctr> NA, Dur, Dur, NA, Dur, Dur, Dur, NA...
## $ Nuker_Nuk <fctr> Nuk, Nuk, NA, Nuk, NA, Nuk, NA, NA,...
## $ Pusher_Pus <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Escape_Esc <fctr> NA, NA, Esc, NA, Esc, Esc, NA, Esc,...
## $ Role_Count <int> 3, 4, 5, 2, 3, 6, 3, 2, 4, 6, 4, 6, ...
## $ Class <fctr> INT, STR, STR, AGI, AGI, AGI, STR, ...
Both ‘MatchDF’ and ‘PlayersDF’ data frames have been cleaned up and contain only pertinent data to the project. They will be brought together by the common ‘match_id’ variable both data frames have. Each player in each match will also have a new variable added to determine their win / loss status based upon the ‘radiant_win’ and ‘team’ variables.
# Combine PlayersDF and MatchDF into CombinedDF
CombinedDF <- left_join(MatchDF, PlayersDF,by = "match_id")
# add column 'WL' to CombinedDF to indicate Win/Loss for each individual player
CombinedDF$WL <- ifelse(CombinedDF$radiant_win == "True" & CombinedDF$team == "R", "Win",
ifelse(CombinedDF$radiant_win == "True" & CombinedDF$team == "D", "Loss",
ifelse(CombinedDF$radiant_win == "False" & CombinedDF$team == "R", "Loss", "Win")))
# add column 'Win' to CombinedDF to indicate win/loss by 1 and 0
CombinedDF$Win <- ifelse(CombinedDF$WL == "Win", as.integer(1), as.integer(0))
# rename column 'Hero_Names' to 'Name'
CombinedDF <- CombinedDF %>% rename(Name=Hero_Names)
glimpse(CombinedDF)
## Observations: 420,510
## Variables: 52
## $ match_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ start_time <int> 1446750112, 1446750112, 1446750112, ...
## $ duration <int> 2375, 2375, 2375, 2375, 2375, 2375, ...
## $ tower_status_radiant <int> 1982, 1982, 1982, 1982, 1982, 1982, ...
## $ tower_status_dire <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1846, ...
## $ barracks_status_dire <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 63, 63...
## $ barracks_status_radiant <int> 63, 63, 63, 63, 63, 63, 63, 63, 63, ...
## $ first_blood_time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 221, 2...
## $ game_mode <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, ...
## $ radiant_win <fctr> True, True, True, True, True, True,...
## $ date <dttm> 2015-11-05 19:01:52, 2015-11-05 19:...
## $ account_id <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6, 0, 7, ...
## $ hero_id <int> 86, 51, 83, 11, 67, 106, 102, 46, 7,...
## $ player_slot <int> 0, 1, 2, 3, 4, 128, 129, 130, 131, 1...
## $ gold <int> 3261, 2954, 110, 1179, 3307, 476, 31...
## $ gold_spent <int> 10960, 17760, 12195, 22505, 23825, 1...
## $ gold_per_min <int> 347, 494, 350, 599, 613, 397, 303, 4...
## $ xp_per_min <int> 362, 659, 385, 605, 762, 524, 369, 5...
## $ kills <int> 9, 13, 0, 8, 20, 5, 4, 4, 1, 1, 3, 9...
## $ deaths <int> 3, 3, 4, 4, 3, 6, 13, 8, 14, 11, 4, ...
## $ assists <int> 18, 18, 15, 19, 17, 8, 5, 6, 8, 6, 9...
## $ denies <int> 1, 9, 1, 6, 13, 5, 2, 31, 0, 0, 0, 9...
## $ last_hits <int> 30, 109, 58, 271, 245, 162, 107, 208...
## $ hero_damage <int> 8690, 23747, 4217, 14832, 33740, 107...
## $ hero_healing <int> 218, 0, 1595, 2714, 243, 0, 764, 0, ...
## $ tower_damage <int> 143, 423, 399, 6055, 1833, 112, 0, 2...
## $ level <int> 16, 22, 17, 21, 24, 19, 16, 19, 12, ...
## $ xp_hero <dbl> 8840, 14331, 6692, 8583, 15814, 8502...
## $ xp_creep <dbl> 5440, 8440, 8112, 14230, 14325, 1225...
## $ xp_roshan <dbl> NA, 2683, NA, 894, NA, NA, NA, NA, N...
## $ gold_death <dbl> -957, -1137, -1436, -2156, -1437, -2...
## $ gold_destroying_structure <dbl> 3120, 3299, 3142, 4714, 3217, 320, 3...
## $ gold_killing_heros <dbl> 5145, 6676, 2418, 4104, 7467, 5281, ...
## $ gold_killing_creeps <dbl> 1087, 4317, 3697, 10432, 9220, 6193,...
## $ gold_killing_roshan <dbl> 400, 937, 400, 400, 400, NA, NA, NA,...
## $ unit_order_total <dbl> 5041, 8385, 9167, 6396, 5588, 8197, ...
## $ team <chr> "R", "R", "R", "R", "R", "D", "D", "...
## $ Name <fctr> Rubick, Clockwerk, Treant Protector...
## $ All_Roles <chr> " Dis Sup Nuk ", " Dis Ini Dur...
## $ Carry_Car <fctr> NA, NA, NA, Car, Car, Car, Car, Car...
## $ Disabler_Dis <fctr> Dis, Dis, Dis, NA, NA, Dis, NA, NA,...
## $ Initiator_Ini <fctr> NA, Ini, Ini, NA, NA, Ini, NA, NA, ...
## $ Jungler_Jun <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Support_Sup <fctr> Sup, NA, Sup, NA, NA, NA, Sup, NA, ...
## $ Durable_Dur <fctr> NA, Dur, Dur, NA, Dur, Dur, Dur, NA...
## $ Nuker_Nuk <fctr> Nuk, Nuk, NA, Nuk, NA, Nuk, NA, NA,...
## $ Pusher_Pus <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Escape_Esc <fctr> NA, NA, Esc, NA, Esc, Esc, NA, Esc,...
## $ Role_Count <int> 3, 4, 5, 2, 3, 6, 3, 2, 4, 6, 4, 6, ...
## $ Class <fctr> INT, STR, STR, AGI, AGI, AGI, STR, ...
## $ WL <chr> "Win", "Win", "Win", "Win", "Win", "...
## $ Win <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
The ‘CombinedDF’ data frame has 420,510 observations across 52 variables and represents the collection of all data that will be discussed in this report. This will be the source for all studies going forward in this project.
# create 'CombinedDF.csv' file for further use in future Rmd documents.
write.csv(CombinedDF, "CombinedDF.csv")