The Dataset & Initial Exploratory Data Analysis

A dataset containing a wide variety of information posted to Kaggle by user Devin www.kaggle.com/devinanzelmo/dota-2-matches has a very interesting range of variables set out across several CSV files. Devin has obtained this data from a data mining website called www.opendota.com which collects match data from the Dota2 game servers. This study will mainly focus on the analysis of two of these files:

match.csv - contains 50,000 observations over 10 variable

players.csv - contains 500,000 observations over 73 variables

Other useful files that will support this work include:

hero_names.csv - originally contained 112 observations over 3 variables. This file will have more variables added based upon defined roles and classes from dota2.gamepedia.com/Role.

match.csv

## Observations: 50,000
## Variables: 13
## $ match_id                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ start_time              <int> 1446750112, 1446753078, 1446764586, 14...
## $ duration                <int> 2375, 2582, 2716, 3085, 1887, 1574, 21...
## $ tower_status_radiant    <int> 1982, 0, 256, 4, 2047, 2047, 1972, 204...
## $ tower_status_dire       <int> 4, 1846, 1972, 1924, 0, 4, 0, 0, 1982,...
## $ barracks_status_dire    <int> 3, 63, 63, 51, 0, 3, 3, 0, 63, 63, 51,...
## $ barracks_status_radiant <int> 63, 0, 48, 3, 63, 63, 63, 63, 0, 0, 63...
## $ first_blood_time        <int> 1, 221, 190, 40, 58, 113, 4, 255, 4, 8...
## $ game_mode               <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
## $ radiant_win             <fctr> True, False, False, False, True, True...
## $ negative_votes          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ positive_votes          <int> 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cluster                 <int> 155, 154, 132, 191, 156, 155, 151, 138...

Quick Overview of match.csv

Variable Description
match_id unique integer to identify matches across all files
start_time match start time in seconds since 00:00, 01/01/1970, UTC
duration length of match in seconds
tower_status_radiant total health of all Radiant towers at match end
tower_status_dire total health of all Dire towers at match end
barracks_status_dire total health of all Dire barracks at match end
barracks_status_radiant total health of all Radiant barracks at match end
first_blood_time timestamp in seconds of when first player is killed from match start
game_mode an integer indicating which game mode the match is playing
radiant_win a factor indicating if the radiant team won or lost the match
negative_votes number of negative votes a match received
positive_votes number of positive votes a match received
cluster integer indicating which region of the world the match was played

Let’s look at the histograms of some of these variables. The variables ‘match_id’, ‘game_mode’, ‘radiant_win’, ‘negative_votes’, ‘positive_votes’ and ‘cluster’ will not be examined by histogram.

Taking a quick look at these variables in a histogram setting shows some interesting features.

start_time

Start_time represents when each match started in time in seconds since 00:00, 01/01/1970, UTC. It is noted that the bulk of matches recorded reside in the second half of the ‘start_time’ range. When asking Devin why this might be he responded saying he believes this to be due to the way Opendota.com sampled the Dota 2 game servers (please see www.kaggle.com/louisms/discussion). The change in frequency represents a moment when Opendota.com improved how it sampled games to capture more of them. This discrepancy does not affect how each individual game recorded was played and so none will be excluded based upon this.

duration

‘Duration’ seems to show a fairly normal distribution between roughly 1000 and 5000 seconds, however it is noted that there is some data residing above 16000 seconds, which is most likely outlier data representing games that were not played at a competitive level. It is also noted that the shortest game is 59 seconds long, clearly not a competitive game. Let’s take a closer look at its distribution.

It is important for this study to take the matches that were most likely played competitively. Duration is a key indicator for this and so upon researching average game lengths in Dota 2 forums it is generally agreed that matches on average last between 35-45 minutes long (2100 - 2700 seconds) with matches lasting 25 minutes (1500 seconds) considered ‘short’ and 60 minutes (3600 seconds) considered ‘long’ (link). For this study we will only look at matches that last between 15 - 75 minutes (900 - 4500 seconds).

## Observations: 49,678
## Variables: 14
## $ match_id                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ start_time              <int> 1446750112, 1446753078, 1446764586, 14...
## $ duration                <int> 2375, 2582, 2716, 3085, 1887, 1574, 21...
## $ tower_status_radiant    <int> 1982, 0, 256, 4, 2047, 2047, 1972, 204...
## $ tower_status_dire       <int> 4, 1846, 1972, 1924, 0, 4, 0, 0, 1982,...
## $ barracks_status_dire    <int> 3, 63, 63, 51, 0, 3, 3, 0, 63, 63, 51,...
## $ barracks_status_radiant <int> 63, 0, 48, 3, 63, 63, 63, 63, 0, 0, 63...
## $ first_blood_time        <int> 1, 221, 190, 40, 58, 113, 4, 255, 4, 8...
## $ game_mode               <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
## $ radiant_win             <fctr> True, False, False, False, True, True...
## $ negative_votes          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ positive_votes          <int> 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cluster                 <int> 155, 154, 132, 191, 156, 155, 151, 138...
## $ date                    <dttm> 2015-11-05 19:01:52, 2015-11-05 19:51...

This has removed 322 matches leaving 49678 matches in the study with the following distribution of ‘duration’.

tower_status & barracks_status

The variables recording each sides’ tower status and barracks status at match end have similar distributions. When looking at their difference (‘Tower_Diff’ and ‘Barracks_Diff’) we see bimodal distributions which are expected. As one team wins their towers and barracks are expected to have much more health than their enemies’ towers and hence the bimodal distributions. There are no reasons to modify the dataset based upon these variables. Further to this these variables may be able to define each match into a win status such as ‘strong_win’, ‘strong_loss’ and ‘close_game’.

When comparing the difference in towers’ health between the teams with duration we find this interesting plot.

The x-axis represents match duration in seconds and the y-axis represents the difference between the sum of total tower health for each side when each match ends, with positive values representing the Radiant sides’ towers having more health and negative values representing the Dire sides’ towers having more health. The colour differentiates which side won from the perspective of the Radiant side.

An interesting observation from this initial plot is that there are cases where although one teams’ towers have more health they do not necessarily win the match, which is why that within the +500 to -500 TowerStat_R range we find both teams winning and losing matches. This can be used to group each match into either a dominant win or a close win for either side.

first_blood

The variable ‘first_blood’ displays a right-skewed distribution which is expected since it would be rare for a game to go on too long without a single player dying. This variable maybe able to help define the win status as we might expect more earlier ‘first_blood’ kills recorded in strong win/loss scenarios and later ‘first_blood’ kills in closer games.

Other Key Variables

Here are other important variables to include that were not part of the histogram process above.

match_id

The ‘match_id’ variable is an integer number that Devin updated with his own numbers from 0 to 49,999, which we also find in ‘players’ data frame and are consistent across all files. As discussed with Devin (see here https://www.kaggle.com/louisms/discussion see ‘A quick look at Dota 2 dataset’) he did this to save space in his files since the original ‘match_id’ values from Dota 2 servers are much longer. This variable will be key to linking the ‘match’ and ‘players’ data frames together.

game_mode

The variable ‘game_mode’ was investigated as so.

## # A tibble: 2 × 2
##   game_mode     n
##       <int> <int>
## 1         2  1316
## 2        22 48362

This shows that there were 1316 matches played as game mode ‘2’ and 48362 as game mode ‘22’. It may be worth exploring if there is any discernable difference between the two mode and whether this difference is enough to include or exclude matches under game mode ‘2’. It may depend upon the question as it is suspected that questions such as ‘which hero is most likely to create a win condition’ maybe independent of game type and as such no matches should be excluded based upon this. For now this variable will not be used to exclude any matches.

radiant_win

The variable ‘radiant_win’ is a logical factor with either ‘True’ or ‘False’ indicating if the Radiant team won or lost the match. This variable is crucial for identifying who won each match however a new variable should be made so that it identifies if a team won or lost a match. This will allow for linking individual players and heroes to wins and losses. This will be done after ‘match’ and ‘players’ data frames are merged because information from both are required to determine this.

negative_votes and positive_votes

These variables are not considered to be influential on the win condition of a match, other than each player’s view as to how friendly other players were in the game. These variables will not initially be included in the dataset, but will be remembered in case there is reason to return them to the study.

cluster

The ‘cluster’ variable refers to where in the world these games were hosted and is not relevant to the win condition for each match. This variable will not be included in the dataset.

Cleaned Match Dataset

For this study the following variables from ‘match’ will be used:

1 match_id 2 start_time 3 duration 4 tower_status_radiant 5 tower_status_dire 6 barracks_status_dire 7 barracks_status_radiant 8 first_blood_time 9 game_mode 10 radiant_win 14 date

# Collect key variables from 'match' dataset
MatchDF <- MatchNewDur %>% select(1 : 10, 14)

players.csv

## Observations: 500,000
## Variables: 73
## $ match_id                          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ account_id                        <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6...
## $ hero_id                           <int> 86, 51, 83, 11, 67, 106, 102...
## $ player_slot                       <int> 0, 1, 2, 3, 4, 128, 129, 130...
## $ gold                              <int> 3261, 2954, 110, 1179, 3307,...
## $ gold_spent                        <int> 10960, 17760, 12195, 22505, ...
## $ gold_per_min                      <int> 347, 494, 350, 599, 613, 397...
## $ xp_per_min                        <int> 362, 659, 385, 605, 762, 524...
## $ kills                             <int> 9, 13, 0, 8, 20, 5, 4, 4, 1,...
## $ deaths                            <int> 3, 3, 4, 4, 3, 6, 13, 8, 14,...
## $ assists                           <int> 18, 18, 15, 19, 17, 8, 5, 6,...
## $ denies                            <int> 1, 9, 1, 6, 13, 5, 2, 31, 0,...
## $ last_hits                         <int> 30, 109, 58, 271, 245, 162, ...
## $ stuns                             <fctr> 76.7356, 87.4164, None, Non...
## $ hero_damage                       <int> 8690, 23747, 4217, 14832, 33...
## $ hero_healing                      <int> 218, 0, 1595, 2714, 243, 0, ...
## $ tower_damage                      <int> 143, 423, 399, 6055, 1833, 1...
## $ item_0                            <int> 180, 46, 48, 63, 114, 145, 5...
## $ item_1                            <int> 37, 63, 60, 147, 92, 73, 11,...
## $ item_2                            <int> 73, 119, 59, 154, 147, 149, ...
## $ item_3                            <int> 56, 102, 108, 164, 0, 48, 36...
## $ item_4                            <int> 108, 24, 65, 79, 137, 212, 1...
## $ item_5                            <int> 0, 108, 0, 160, 63, 0, 81, 2...
## $ level                             <int> 16, 22, 17, 21, 24, 19, 16, ...
## $ leaver_status                     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ xp_hero                           <dbl> 8840, 14331, 6692, 8583, 158...
## $ xp_creep                          <dbl> 5440, 8440, 8112, 14230, 143...
## $ xp_roshan                         <dbl> NA, 2683, NA, 894, NA, NA, N...
## $ xp_other                          <dbl> 83, 671, 453, 293, 62, 1, 1,...
## $ gold_other                        <dbl> 50, 395, 259, 100, NA, NA, N...
## $ gold_death                        <dbl> -957, -1137, -1436, -2156, -...
## $ gold_buyback                      <dbl> NA, NA, -1015, NA, -1056, -2...
## $ gold_abandon                      <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ gold_sell                         <dbl> 212, 1650, NA, 938, 4194, 20...
## $ gold_destroying_structure         <dbl> 3120, 3299, 3142, 4714, 3217...
## $ gold_killing_heros                <dbl> 5145, 6676, 2418, 4104, 7467...
## $ gold_killing_creeps               <dbl> 1087, 4317, 3697, 10432, 922...
## $ gold_killing_roshan               <dbl> 400, 937, 400, 400, 400, NA,...
## $ gold_killing_couriers             <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_none                   <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_move_to_position       <dbl> 4070, 5894, 7053, 4712, 3853...
## $ unit_order_move_to_target         <dbl> 1, 214, 3, 133, 7, 166, 63, ...
## $ unit_order_attack_move            <dbl> 25, 165, 132, 163, 7, 76, 10...
## $ unit_order_attack_target          <dbl> 416, 1031, 645, 690, 1173, 8...
## $ unit_order_cast_position          <dbl> 51, 98, 36, 9, 31, 196, 13, ...
## $ unit_order_cast_target            <dbl> 144, 39, 160, 15, 84, 3, 173...
## $ unit_order_cast_target_tree       <dbl> 3, 4, 20, 7, 8, 5, 14, 3, 9,...
## $ unit_order_cast_no_target         <dbl> 71, 439, 373, 406, 198, 96, ...
## $ unit_order_cast_toggle            <dbl> NA, NA, NA, NA, NA, 2, NA, N...
## $ unit_order_hold_position          <dbl> 188, 346, 643, 150, 111, 161...
## $ unit_order_train_ability          <dbl> 16, 22, 17, 21, 23, 19, 16, ...
## $ unit_order_drop_item              <dbl> NA, NA, 5, NA, 1, NA, NA, 2,...
## $ unit_order_give_item              <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_pickup_item            <dbl> NA, 12, 7, 1, NA, 2, 1, 18, ...
## $ unit_order_pickup_rune            <dbl> 2, 52, 8, 9, 2, NA, 1, 18, 1...
## $ unit_order_purchase_item          <dbl> 35, 30, 28, 45, 44, 36, 43, ...
## $ unit_order_sell_item              <dbl> 2, 4, NA, 7, 6, 3, 3, 1, NA,...
## $ unit_order_disassemble_item       <dbl> NA, NA, 1, NA, NA, NA, NA, N...
## $ unit_order_move_item              <dbl> 11, 21, 18, 14, 13, 3, 13, 1...
## $ unit_order_cast_toggle_auto       <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_stop                   <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_taunt                  <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_buyback                <dbl> NA, NA, 1, NA, 1, 2, NA, NA,...
## $ unit_order_glyph                  <dbl> NA, NA, NA, 1, 3, NA, 4, NA,...
## $ unit_order_eject_item_from_stash  <dbl> NA, NA, NA, NA, NA, NA, 1, N...
## $ unit_order_cast_rune              <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_ping_ability           <dbl> 6, 14, 17, 13, 23, 2, 1, 4, ...
## $ unit_order_move_to_direction      <dbl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_patrol                 <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_vector_target_position <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_radar                  <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_set_item_combine_lock  <lgl> NA, NA, NA, NA, NA, NA, NA, ...
## $ unit_order_continue               <lgl> NA, NA, NA, NA, NA, NA, NA, ...

Quick Overview of players.csv

Variables Descriptions
match_id unique integer to identify matches across all files
account_id unique integer to identify players. 0 (zero) represent anonymous players
hero_id unique integer to identify which hero was selected per player
player_slot unique integer where 0-4 represent players on Radiant side and 128-132 the Dire side
gold amount of gold player has at match end
gold_spent amount of gold player spent during the match
gold_per_min amount of gold per minute a player accumulated on average during the match
xp_per_min amount of experience points per minute a player accumulated on average during the match
kills total kills a player scored per match
deaths number of times a player died per match
assists number of assists a player gained per match
denies number of denies a player performed per match
last_hits number of last hits a player landed on creep per match
stuns cumulative amount of time a player stunned an enemy player
hero_damage cumulative amount of damage a player caused to enemy heroes
tower_damage cumulative amount of damage a player caused to enemy towers
item_0 to _5 integers referring to items a player equipped during the match
level the hero character level a player obtained by match end
leaver_status integer indicating if a player stayed until match end or left early
xp_hero experience points gained from killing heros
xp_creep experience points gained from killing creep
xp_roshan experience points gained from killing Roshan, neutral creep boss in jungle
xp_other experience points gained from other sources
gold_other gold gained from other non-conventional sources
gold_death cumulative amount of gold lost due to death
gold_buyback gold spent buying back lost items
gold_abandon gold abandoned by a player
gold_sell gold gained from selling items
gold_destroying_structure gold gained from destroying structures e.g. tower, barracks, ancient
gold_killing_heros gold gained from killing enemy heroes
gold_killing_creep gold gained from killing creeps
gold_killing_roshan gold gained from killing Roshan, high level neutral creep boss in jungle
gold_killing_couriers gold gained from killing enemy courriers
unit_order_

The variables starting with ‘unit_order_’ (columns 40 - 73 in ‘players’ data frame) represent the number of physical mouse and keyboard clicks a player did to perform certain actions. The usefulness of these variables are not immediately obvious however an accumulation of all clicks per player per match will be collected since it could be used as a rough indicator of how advanced a player is or how skillful a player needs to be to operate a hero effectively.

The histogram of the variable ‘unit_order_total’ has mainly a normal distribution with a small right skew tail towards 250000. It maybe more informative when examined over time. This will require the duration variable from MatchDF.

Let’s look at the histograms of some of the other key variables.

The ‘players’ data frame contains data from the same 50,000 matches recorded in ‘match’ for each player in each match. Each match has ten players, five per side which is why this data frame has 500,000 observations. As mentioned before the variable ‘match_id’ will be used to link the two main data frames together, ‘match’ and ‘players’.

hero_id

The variable ‘hero_id’ represents an integer that when linked to ‘hero_names.csv’ identifies which hero each player selected prior to each match. This will be key in determining which heroes are best and which should be avoided when first starting out in the game. It is noted that ‘hero_id’ ranges from ‘r min(players\(hero_id)' to 'r max(players\)hero_id)’, however hero_id 0 is invalid and represents when a hero was not selected. This means that matches with hero_id 0 were unbalanced from the start and should be removed from the data set.

# identifying which matches have hero_id 0 present,

Tally0 <- PlayersShort %>% 
  group_by(match_id, hero_id) %>% 
  tally() %>% 
  filter(hero_id == 0)

glimpse(Tally0)
## Observations: 35
## Variables: 3
## $ match_id <int> 720, 1032, 1108, 2134, 2773, 7098, 7488, 7582, 7831, ...
## $ hero_id  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ n        <int> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

‘Tally0’ shows us there are 35 matches that contain at least one hero_id 0, which represent unbalanced matches since one side only has only four active players. These shall be removed from the ‘MatchDF’ data set.

# identify all match_ids that have hero_id 0 in them and remove from the data set MatchDF
HeroIDrm <- c(Tally0$match_id)
MatchDF <- MatchDF %>% filter(!(match_id %in% HeroIDrm))

Applying this to MatchDF has only removed one match from this data set since the duration limits previously applied has already filtered most of these matches, as would be expected. When teams are unbalanced it would be expected that the match duration would be outside the expected duration of a competitive match. There are now 49677 matches in this data set.

Let’s take a more detailed look at the cleaned hero_id variable’s distribution.

This histogram shows quite a variance on the number of times each hero is selected every match and shows their popularity within the matches studied here. This will be important when investigating the different win conditions.

gold and experience points (xp) per minute

The variables ‘gold_per_min’, ‘all_gold_gains’, ‘all_gold_loss’ (by death or purchases), ‘xp_per_min’ (experience points per minute) and ‘all_xp_gains’ are all general statistics not necessarily tied to PvP combat that are believed to have influence on players’ performances and contribute to win conditions. These are normal distributions and all gold and xp variables will be cross-examined to see if they support any models that might help determine how well a team performed towards achieving win conditions.

kills, deaths and assists

The variables ‘kills’ and ‘deaths’ are player-vs-player (PvP) combat statistics collected per player which represent the number of enemy player kills scored per match and the number of deaths the player suffered by enemy hands per match. The number of assists per match represents the number of enemy player deaths each player significantly contributed to, but did not land the final hit on. These variables will help to indicate how well each player’s performance is relative to all the other players in the match. Further investigations could then link each player by their player_id assignment previously mentioned. These show normal distributions and there are no reasons to exclude any data based upon these variables.

denies

The variable ‘denies’ is a count of how many allied creep below 50% a player has killed to stop enemy players scoring last hits and gaining xp and gold. This is another variable that could be studied in combination with others to see if this has any influence on the win condition. This shows a right-skewed distribution which is to be expected as denying enemy players of their creep kills is rare showing a bias towards the lower values here.

last_hits

Last hits represents the number of last hits scored by a player on creep. Last hit is important to track since players are required to perform last hits on creep to gain xp and gold from the kill. This is how denies work in denying enemies from gaining xp and gold. The distribution here is more normal than the denies distribution with a bit of right-skew as it is generally more sought after by players.

hero_damage

The hero damage distribution shows a normal distribution and represents the cumulative amount of damage a player has caused to enemy players. This will be useful for determining hero effectiveness against enemy players.

hero_healing

The hero healing variable represents the amount of healing a player has performed upon allied heroes. The distribution is a very narrow right-skew, mostly due to the fact that not many heroes have the ability to heal. This variable will help to indicate the effectiveness of healing support role heroes where their other variable counts, such as kills, might be lacking.

tower_damage

The tower damage variable indicates how much damage a player caused to enemy towers during the match. The distribution here is right-skewed as expected. There is limited enemy tower health to destroy and it is unlikely that a single hero from one team would be responsible for all the tower damage in a single match.

level

The level variable indicates what level the hero reached at match end. The distribution is fairly flat above 10. This variable will be helpful when comparing hero levels from winning and losing teams.

Other key variables from ‘players’

account_id

The variable ‘account_id’ is a number Devin applied to each new unique player that the dataset recorded. Devin applied his own numbering to each account to save space since the original ‘account_id’ values were much larger. 0 represents a player who is playing anonymously but those recorded from 1 to 158360 represents unique players. No data will be removed based upon this variable.

A look at how often individual players are recorded in the dataset reveals something interesting.

# tally up the number of times a player is recorded in the dataset
PlayerTally <- PlayersShort %>%
  group_by(account_id) %>%
  tally(sort=TRUE) %>%
  filter(!(account_id==0))

glimpse(PlayerTally)
## Observations: 158,360
## Variables: 2
## $ account_id <int> 2701, 2362, 10307, 18680, 4390, 2962, 2487, 6551, 1...
## $ n          <int> 71, 60, 59, 59, 58, 57, 54, 54, 54, 54, 53, 51, 51,...

Here we see the most recorded player (account_id 2701) has been recorded in only 71 matches. In a dataset of nearly 50,000 matches it may not be possible to draw significant conclusions regarding player behaviour in this dataset.

leaver_status

The leaver status variable ranges from 0 to 4, where 0 indicates matches where no one left and 1-4 indicate varying degrees of when players left during matches. The important point here is to only include matches where no one left the match. As indicated by the histogram most of the matches did not have any leavers.

# investigate the number of matches where no one left before the end of the match
TallyLS <- PlayersShort %>% 
  group_by(match_id, leaver_status) %>% 
  tally() %>% 
  filter(leaver_status != 0)

glimpse(TallyLS)
## Observations: 9,593
## Variables: 3
## $ match_id      <int> 7, 19, 22, 32, 32, 35, 36, 36, 49, 62, 73, 82, 8...
## $ leaver_status <int> 1, 1, 1, 1, 2, 1, 1, 3, 2, 1, 4, 1, 1, 2, 4, 1, ...
## $ n             <int> 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, ...

This shows there are 9593 players who are recorded leaving the match at some point. These matches should be removed from the dataset because even if only one player leaves this represents a loss of force by 20% for the corresponding side. This gives the team with all players present an advantage over the other significantly and will skew all the variables recorded for both teams. It maybe interesting at a later date to compare these matches with leavers to those without to see how much this affects match outcomes.

# identify all match_ids that have leaver_Status > 0 and remove from the data set MatchDF
LSrm <- c(TallyLS$match_id)
MatchDF <- MatchDF %>% filter(!(match_id %in% LSrm))

Applying this filter to MatchDF reduces the number of matches in the data set to 42051 viable matches. This has reduced the dataset to 84% of the original 50000 matches recorded.

player_slot

The variable ‘player_slot’ contains integers ranging from 0-4 and 128-132 for each match. 0-4 represents the 5 players playing for the Radiant side and 128-132 for those 5 players playing for the Dire side. A new variable will be generated to label which side each player is playing for where 0-4 = Radiant or “R” and 128-132 = Dire or “D”.

# add variable "team" whereby 'player_slot' values 0-4 represent Radiant team ("R") and 128-132 represent Dire team ("D").
PlayersTeam <- PlayersShort
PlayersTeam$team <- ifelse(PlayersTeam$player_slot < 5, "R", "D")

Cleaned Players Dataset

For this study the following 27 variables from the original file ‘players’ will be used:

1 match_id 2 account_id 3 hero_id 4 player_slot 5 gold 6 gold_spent 7 gold_per_min 8 xp_per_min 9 kills 10 deaths 11 assists 12 denies 13 last_hit 15 hero_damage 16 hero_healing 17 tower_damage 24 level 26 xp_hero 27 xp_creep 28 xp_rosan 31 gold_death 35 gold_destroying_structure 36 gold_killing_heros 37 gold_killing_creeps 38 gold_killing_rosan 40 unit_order_total 41 team

# select key variables from player file
PlayersDF <- PlayersTeam %>% select(1 : 13, 15 : 17, 24, 26 : 28, 31, 35 : 38, 40, 41)

hero_names.csv

## Observations: 112
## Variables: 3
## $ name           <fctr> npc_dota_hero_antimage, npc_dota_hero_axe, npc...
## $ hero_id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ localized_name <fctr> Anti-Mage, Axe, Bane, Bloodseeker, Crystal Mai...

The data frame ‘Hero_Names’ holds the key for the ‘hero_id’ variable in the ‘players’ data frame. This data frame will be tidied up and prepared for applying the key to the ‘hero_id’ variable. The first column ‘name’ holds duplicate information and will be removed. The remaining two columns will be named accordingly: “hero_id” and “Hero_Names”.

Hero_Names["name"] <- NULL
colnames(Hero_Names) <- c("hero_id", "Hero_Names")

Further examination of the ‘hero_names’ data frame shows that there is no hero_id 24 in the data set and the most recent hero ‘Monkey King’ is not present since this data set is taken from before Monkey King was introduced to the game.

The hero_names.csv file was updated to include each heroes’ roles and class as defined by dota2.gamepedia.com/Role. Each hero can have multiple roles but only one of three classes. The three classes and their descriptions are:

Hero_Class Class_Descriptions
Strength (STR) warrior-like class mainly dealing melee damage
Agility (AGI) agile class, fast and harder to hit
Intellect (INT) spell-weilding class

The hero roles and their descriptions are:

Roles Gamepedia_Descriptions
Carry Will become more useful later in the game if they gain a significant gold advantage
Disabler Has a guaranteed disable for one or more of their spells
Initiator Good at starting a teamfight
Jungler Can farm effectively from neutral creeps inside the jungle early in the game
Support Can focus less on amassing gold and items, and more on using their abilities to gain an advantage for the team
Durable Has the ability to last longer in teamfights
Nuker Can quickly kill enemy heroes using high damage spells with low cooldowns
Pusher Can quickly siege and destroy towers and barracks at all points of the game
Escape Has the ability to quickly avoid death

For more details on roles please see dota2.gamepedia.com/Role link. The data frame Hero_Names was exported as a csv file and modified to include the class and role information in Excel. It now looks like this.

## Observations: 112
## Variables: 14
## $ hero_id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ Hero_Names    <fctr> Anti-Mage, Axe, Bane, Bloodseeker, Crystal Maid...
## $ All_Roles     <chr> "Car      Nuk  Esc", " Dis Ini Jun  Dur   ", " D...
## $ Carry_Car     <fctr> Car, NA, NA, Car, NA, Car, NA, Car, Car, Car, C...
## $ Disabler_Dis  <fctr> NA, Dis, Dis, Dis, Dis, Dis, Dis, NA, Dis, Dis,...
## $ Initiator_Ini <fctr> NA, Ini, NA, Ini, NA, NA, Ini, NA, NA, NA, NA, ...
## $ Jungler_Jun   <fctr> NA, Jun, NA, Jun, Jun, NA, NA, NA, NA, NA, NA, ...
## $ Support_Sup   <fctr> NA, NA, Sup, NA, Sup, NA, Sup, NA, Sup, NA, NA,...
## $ Durable_Dur   <fctr> NA, Dur, Dur, NA, NA, NA, NA, NA, NA, Dur, NA, ...
## $ Nuker_Nuk     <fctr> Nuk, NA, Nuk, Nuk, Nuk, NA, Nuk, NA, Nuk, Nuk, ...
## $ Pusher_Pus    <fctr> NA, NA, NA, NA, NA, Pus, NA, Pus, NA, NA, NA, P...
## $ Escape_Esc    <fctr> Esc, NA, NA, NA, NA, NA, NA, Esc, Esc, Esc, NA,...
## $ Role_Count    <int> 3, 4, 4, 5, 4, 3, 4, 3, 5, 5, 2, 4, 4, 4, 4, 5, ...
## $ Class         <fctr> AGI, STR, INT, AGI, INT, AGI, STR, AGI, AGI, AG...

Each hero is now linked to their unique class and collection of roles as defined by the community at dota2.gamepedia.com/Role link. These will be bound the the PlayerDF data set.

# combine PlayersDF and Hero_Names by the variable 'hero_id'
PlayersDF <- left_join(PlayersDF, Hero_Names, by = "hero_id")
glimpse(PlayersDF)
## Observations: 500,000
## Variables: 40
## $ match_id                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ account_id                <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6, 0, 7, ...
## $ hero_id                   <int> 86, 51, 83, 11, 67, 106, 102, 46, 7,...
## $ player_slot               <int> 0, 1, 2, 3, 4, 128, 129, 130, 131, 1...
## $ gold                      <int> 3261, 2954, 110, 1179, 3307, 476, 31...
## $ gold_spent                <int> 10960, 17760, 12195, 22505, 23825, 1...
## $ gold_per_min              <int> 347, 494, 350, 599, 613, 397, 303, 4...
## $ xp_per_min                <int> 362, 659, 385, 605, 762, 524, 369, 5...
## $ kills                     <int> 9, 13, 0, 8, 20, 5, 4, 4, 1, 1, 3, 9...
## $ deaths                    <int> 3, 3, 4, 4, 3, 6, 13, 8, 14, 11, 4, ...
## $ assists                   <int> 18, 18, 15, 19, 17, 8, 5, 6, 8, 6, 9...
## $ denies                    <int> 1, 9, 1, 6, 13, 5, 2, 31, 0, 0, 0, 9...
## $ last_hits                 <int> 30, 109, 58, 271, 245, 162, 107, 208...
## $ hero_damage               <int> 8690, 23747, 4217, 14832, 33740, 107...
## $ hero_healing              <int> 218, 0, 1595, 2714, 243, 0, 764, 0, ...
## $ tower_damage              <int> 143, 423, 399, 6055, 1833, 112, 0, 2...
## $ level                     <int> 16, 22, 17, 21, 24, 19, 16, 19, 12, ...
## $ xp_hero                   <dbl> 8840, 14331, 6692, 8583, 15814, 8502...
## $ xp_creep                  <dbl> 5440, 8440, 8112, 14230, 14325, 1225...
## $ xp_roshan                 <dbl> NA, 2683, NA, 894, NA, NA, NA, NA, N...
## $ gold_death                <dbl> -957, -1137, -1436, -2156, -1437, -2...
## $ gold_destroying_structure <dbl> 3120, 3299, 3142, 4714, 3217, 320, 3...
## $ gold_killing_heros        <dbl> 5145, 6676, 2418, 4104, 7467, 5281, ...
## $ gold_killing_creeps       <dbl> 1087, 4317, 3697, 10432, 9220, 6193,...
## $ gold_killing_roshan       <dbl> 400, 937, 400, 400, 400, NA, NA, NA,...
## $ unit_order_total          <dbl> 5041, 8385, 9167, 6396, 5588, 8197, ...
## $ team                      <chr> "R", "R", "R", "R", "R", "D", "D", "...
## $ Hero_Names                <fctr> Rubick, Clockwerk, Treant Protector...
## $ All_Roles                 <chr> " Dis   Sup  Nuk  ", " Dis Ini   Dur...
## $ Carry_Car                 <fctr> NA, NA, NA, Car, Car, Car, Car, Car...
## $ Disabler_Dis              <fctr> Dis, Dis, Dis, NA, NA, Dis, NA, NA,...
## $ Initiator_Ini             <fctr> NA, Ini, Ini, NA, NA, Ini, NA, NA, ...
## $ Jungler_Jun               <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Support_Sup               <fctr> Sup, NA, Sup, NA, NA, NA, Sup, NA, ...
## $ Durable_Dur               <fctr> NA, Dur, Dur, NA, Dur, Dur, Dur, NA...
## $ Nuker_Nuk                 <fctr> Nuk, Nuk, NA, Nuk, NA, Nuk, NA, NA,...
## $ Pusher_Pus                <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Escape_Esc                <fctr> NA, NA, Esc, NA, Esc, Esc, NA, Esc,...
## $ Role_Count                <int> 3, 4, 5, 2, 3, 6, 3, 2, 4, 6, 4, 6, ...
## $ Class                     <fctr> INT, STR, STR, AGI, AGI, AGI, STR, ...

CombinedDF Data Frame

Both ‘MatchDF’ and ‘PlayersDF’ data frames have been cleaned up and contain only pertinent data to the project. They will be brought together by the common ‘match_id’ variable both data frames have. Each player in each match will also have a new variable added to determine their win / loss status based upon the ‘radiant_win’ and ‘team’ variables.

# Combine PlayersDF and MatchDF into CombinedDF
CombinedDF <- left_join(MatchDF, PlayersDF,by = "match_id")

# add column 'WL' to CombinedDF to indicate Win/Loss for each individual player
CombinedDF$WL <- ifelse(CombinedDF$radiant_win == "True" & CombinedDF$team == "R", "Win", 
                  ifelse(CombinedDF$radiant_win == "True" & CombinedDF$team == "D", "Loss", 
                  ifelse(CombinedDF$radiant_win == "False" & CombinedDF$team == "R", "Loss", "Win")))

# add column 'Win' to CombinedDF to indicate win/loss by 1 and 0
CombinedDF$Win <- ifelse(CombinedDF$WL == "Win", as.integer(1), as.integer(0))

# rename column 'Hero_Names' to 'Name'
CombinedDF <- CombinedDF %>% rename(Name=Hero_Names)

glimpse(CombinedDF)
## Observations: 420,510
## Variables: 52
## $ match_id                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ start_time                <int> 1446750112, 1446750112, 1446750112, ...
## $ duration                  <int> 2375, 2375, 2375, 2375, 2375, 2375, ...
## $ tower_status_radiant      <int> 1982, 1982, 1982, 1982, 1982, 1982, ...
## $ tower_status_dire         <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1846, ...
## $ barracks_status_dire      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 63, 63...
## $ barracks_status_radiant   <int> 63, 63, 63, 63, 63, 63, 63, 63, 63, ...
## $ first_blood_time          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 221, 2...
## $ game_mode                 <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, ...
## $ radiant_win               <fctr> True, True, True, True, True, True,...
## $ date                      <dttm> 2015-11-05 19:01:52, 2015-11-05 19:...
## $ account_id                <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6, 0, 7, ...
## $ hero_id                   <int> 86, 51, 83, 11, 67, 106, 102, 46, 7,...
## $ player_slot               <int> 0, 1, 2, 3, 4, 128, 129, 130, 131, 1...
## $ gold                      <int> 3261, 2954, 110, 1179, 3307, 476, 31...
## $ gold_spent                <int> 10960, 17760, 12195, 22505, 23825, 1...
## $ gold_per_min              <int> 347, 494, 350, 599, 613, 397, 303, 4...
## $ xp_per_min                <int> 362, 659, 385, 605, 762, 524, 369, 5...
## $ kills                     <int> 9, 13, 0, 8, 20, 5, 4, 4, 1, 1, 3, 9...
## $ deaths                    <int> 3, 3, 4, 4, 3, 6, 13, 8, 14, 11, 4, ...
## $ assists                   <int> 18, 18, 15, 19, 17, 8, 5, 6, 8, 6, 9...
## $ denies                    <int> 1, 9, 1, 6, 13, 5, 2, 31, 0, 0, 0, 9...
## $ last_hits                 <int> 30, 109, 58, 271, 245, 162, 107, 208...
## $ hero_damage               <int> 8690, 23747, 4217, 14832, 33740, 107...
## $ hero_healing              <int> 218, 0, 1595, 2714, 243, 0, 764, 0, ...
## $ tower_damage              <int> 143, 423, 399, 6055, 1833, 112, 0, 2...
## $ level                     <int> 16, 22, 17, 21, 24, 19, 16, 19, 12, ...
## $ xp_hero                   <dbl> 8840, 14331, 6692, 8583, 15814, 8502...
## $ xp_creep                  <dbl> 5440, 8440, 8112, 14230, 14325, 1225...
## $ xp_roshan                 <dbl> NA, 2683, NA, 894, NA, NA, NA, NA, N...
## $ gold_death                <dbl> -957, -1137, -1436, -2156, -1437, -2...
## $ gold_destroying_structure <dbl> 3120, 3299, 3142, 4714, 3217, 320, 3...
## $ gold_killing_heros        <dbl> 5145, 6676, 2418, 4104, 7467, 5281, ...
## $ gold_killing_creeps       <dbl> 1087, 4317, 3697, 10432, 9220, 6193,...
## $ gold_killing_roshan       <dbl> 400, 937, 400, 400, 400, NA, NA, NA,...
## $ unit_order_total          <dbl> 5041, 8385, 9167, 6396, 5588, 8197, ...
## $ team                      <chr> "R", "R", "R", "R", "R", "D", "D", "...
## $ Name                      <fctr> Rubick, Clockwerk, Treant Protector...
## $ All_Roles                 <chr> " Dis   Sup  Nuk  ", " Dis Ini   Dur...
## $ Carry_Car                 <fctr> NA, NA, NA, Car, Car, Car, Car, Car...
## $ Disabler_Dis              <fctr> Dis, Dis, Dis, NA, NA, Dis, NA, NA,...
## $ Initiator_Ini             <fctr> NA, Ini, Ini, NA, NA, Ini, NA, NA, ...
## $ Jungler_Jun               <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Support_Sup               <fctr> Sup, NA, Sup, NA, NA, NA, Sup, NA, ...
## $ Durable_Dur               <fctr> NA, Dur, Dur, NA, Dur, Dur, Dur, NA...
## $ Nuker_Nuk                 <fctr> Nuk, Nuk, NA, Nuk, NA, Nuk, NA, NA,...
## $ Pusher_Pus                <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Escape_Esc                <fctr> NA, NA, Esc, NA, Esc, Esc, NA, Esc,...
## $ Role_Count                <int> 3, 4, 5, 2, 3, 6, 3, 2, 4, 6, 4, 6, ...
## $ Class                     <fctr> INT, STR, STR, AGI, AGI, AGI, STR, ...
## $ WL                        <chr> "Win", "Win", "Win", "Win", "Win", "...
## $ Win                       <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...

The ‘CombinedDF’ data frame has 420,510 observations across 52 variables and represents the collection of all data that will be discussed in this report. This will be the source for all studies going forward in this project.

# create 'CombinedDF.csv' file for further use in future Rmd documents.
write.csv(CombinedDF, "CombinedDF.csv")