The Dataset and Initial EDA

The Dataset & Initial Exploratory Data Analysis

A dataset containing a wide variety of information posted to Kaggle by user Devin www.kaggle.com/devinanzelmo/dota-2-matches link has a very interesting range of variables set out accross several CSV files. Devin has obtained this data from a data mining website called opendota.com link which collects match data from the Dota2 game servers.

This study will mainly focus on the analysis of two of these files:

match.csv - contains 50,000 observations over 10 variable players.csv - contains 500,000 observations over 73 variables

Other useful files that will support this work include:

hero_names.csv - originally contained 112 observations over 3 variables. This file will have more variables added based upon defined roles and classes from http://dota2.gamepedia.com/Role link.

match.csv

match <- read.csv("match.csv")
glimpse(match)

## Observations: 50,000
## Variables: 13
## $ match_id                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ start_time              <int> 1446750112, 1446753078, 1446764586, 14...
## $ duration                <int> 2375, 2582, 2716, 3085, 1887, 1574, 21...
## $ tower_status_radiant    <int> 1982, 0, 256, 4, 2047, 2047, 1972, 204...
## $ tower_status_dire       <int> 4, 1846, 1972, 1924, 0, 4, 0, 0, 1982,...
## $ barracks_status_dire    <int> 3, 63, 63, 51, 0, 3, 3, 0, 63, 63, 51,...
## $ barracks_status_radiant <int> 63, 0, 48, 3, 63, 63, 63, 63, 0, 0, 63...
## $ first_blood_time        <int> 1, 221, 190, 40, 58, 113, 4, 255, 4, 8...
## $ game_mode               <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
## $ radiant_win             <fctr> True, False, False, False, True, True...
## $ negative_votes          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ positive_votes          <int> 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cluster                 <int> 155, 154, 132, 191, 156, 155, 151, 138...

Quick Overview of match.csv

Variable Description match_id unique interger to identify matches accross all files start_time match start time in seconds since 00:00, 01/01/1970, UTC duration length of match in seconds tower_status_radiant total health of all Radiant towers at match end tower_status_dire total health of all Dire towers at match end barracks_status_dire total health of all Dire barracks at match end barracks_status_radiant total health of all Radiant barracks at match end first_blood_time timestamp in seconds of when first player is killed from match start game_mode an integer indicating which game mode the match is playing radiant_win a factor indicating if the radiant team won or lost the match negative_votes number of negative votes a match recieved positive_votes number of positive votes a match recieved cluster integer indicating which region of the world the match was played

Lets look at the histograms of some of these variables. The variables ‘match_id’, ‘game_mode’, ‘radiant_win’, ‘negative_votes’, ‘positive_votes’ and ‘cluster’ will not be examined by histogram.

# select variables to be examined by histogram
M <- match[c("start_time", "duration", "tower_status_radiant", "tower_status_dire", "barracks_status_dire", "barracks_status_radiant", "first_blood_time")]

# include difference between tower and barracks health from the perspective of the Radiant side, whereby positive values idicate Radiants had more health and negaitve values indicate Dires had more health

M$Tower_Diff <- match$tower_status_radiant - match$tower_status_dire
M$Barracks_Diff <- match$barracks_status_radiant - match$barracks_status_dire

ggplot(data = melt(M), aes(x = value)) + 
  geom_histogram(bins = 20) + 
  facet_wrap(~variable, scales = "free_x")

## No id variables; using all as measure variables

Taking a quick look at these variables in a histogram setting shows some intersting features.

start_time

Start_time represents when each match started in time in seconds since 00:00, 01/01/1970, UTC. It is noted that the bulk of matches recorded reside in the second half of the ‘start_time’ range. When asking Devin why this might be he responded saying he believes this to be due to the way Opendota.com sampled the Dota 2 game servers (please see https://www.kaggle.com/louisms/discussion link). The change in frequency represents a moment when Opendota.com improved how it sampled games to capture more of them. This discrepency doe not effect how each individual game recorded was played and so no will be excluded based upon this.

# Convert start time from seconds to POSIXct date
MatchNewST <- match %>% mutate(date = as.POSIXct(match$start_time, tz = "UTC", origin = "1970-01-01"))

duration

‘Duration’ seems to show a fairly normal distribution between roughly 1000 and 5000 seconds, however it is noted that there is some data residing above 16000 seconds, which is most likely outlier data representing games that were not played at a competitive level. It is also noted that the shortest game is 59 seconds long, clearly not a competitive game. Let’s take a closer look at its distribution.

DurHist <- ggplot(MatchNewST, aes(x = duration, fill = "red")) + 
  geom_histogram(bins = 1000) +
  scale_x_continuous(breaks = c(0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000,13000,14000,15000,16000))
DurHist

It is important for this study to take the matches that were most likely played competitively. Duration is a key indicator for this and so upon researching average game lengths it is generally agreed that matches on average last between 35-45 minutes long (2100 - 2700 seconds) with matches lasting 25 minutes (1500 seconds) considered ‘short’ and 60 minutes (3600 seconds) considered ‘long’ (link). For this study we will only look at matches that last between 15 - 75 minutes (900 - 4500 seconds).

# limit the dataset to matches that lasted between 15 and 75 minutes
MatchNewDur <- subset(MatchNewST, duration >= 900 & duration <= 4500)
glimpse(MatchNewDur)

## Observations: 49,678
## Variables: 14
## $ match_id                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...
## $ start_time              <int> 1446750112, 1446753078, 1446764586, 14...
## $ duration                <int> 2375, 2582, 2716, 3085, 1887, 1574, 21...
## $ tower_status_radiant    <int> 1982, 0, 256, 4, 2047, 2047, 1972, 204...
## $ tower_status_dire       <int> 4, 1846, 1972, 1924, 0, 4, 0, 0, 1982,...
## $ barracks_status_dire    <int> 3, 63, 63, 51, 0, 3, 3, 0, 63, 63, 51,...
## $ barracks_status_radiant <int> 63, 0, 48, 3, 63, 63, 63, 63, 0, 0, 63...
## $ first_blood_time        <int> 1, 221, 190, 40, 58, 113, 4, 255, 4, 8...
## $ game_mode               <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22...
## $ radiant_win             <fctr> True, False, False, False, True, True...
## $ negative_votes          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ positive_votes          <int> 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ cluster                 <int> 155, 154, 132, 191, 156, 155, 151, 138...
## $ date                    <dttm> 2015-11-05 19:01:52, 2015-11-05 19:51...

This has removed 322 matches leaving 49678 matches in the study with the following distribution of ‘duration’.

DurHistNew <- ggplot(MatchNewDur, aes(x = duration, fill = "red")) + 
  geom_histogram(bins = 1000) +
  scale_x_continuous(breaks = c(0,500,1000,1500,2000,2500,3000,3500,4000,4500,5000))
DurHistNew

tower_status & barracks_status

The variables recording each sides’ tower status and barracks status at match end have similar distributions. When looking at their difference (‘Tower_Diff’ and ‘Barracks_Diff’) we see bimodal distributions which are expected. As one team wins their towers and barracks are expected to have much more health than their enemies’ towers and hense the bimodal distributions. There are no reasons to modify the dataset based upon these variables. Further to this these variables maybe able to define each match into a win status such as ‘strong_win’, ‘strong_loss’ and ‘close_game’.

When comparing the difference in towers’ health between the teams with duration we find this interesting plot.

# TowerStat_R is the health difference between Radiant and Dire towers at the end of each game.
TowerStat_R <- MatchNewDur$tower_status_radiant - MatchNewDur$tower_status_dire

# data frame containing match_duration, TowerStat_R and radiant_win status
TvTowerStatus <- data.frame(MatchNewDur$duration, TowerStat_R, MatchNewDur$radiant_win)
colnames(TvTowerStatus) <- c("Duration", "TowerStat_R", "Radiant_Win")

# GGplot to compare match durations against difference between Radiant and Dire towers health coloured by radiant_win variable.

G1 <- ggplot(TvTowerStatus, aes(x = Duration, y = TowerStat_R, colour = Radiant_Win)) + geom_point()
print(G1)

The x-axis represents match duration in seconds and the y-axis represents the difference between the sum of total tower health for each side when each match ends, with positive values representing the Radiant sides’ towers having more health and negative values representing the Dire sides’ towers having more health. The colour differentiates which side won from the perspective of the Radiant side.

An interesting observation from this initial plot is that there are cases where although one teams’ towers have more health they do not necessarily win the match, which is why that within the +500 to -500 TowerStat_R range we find both teams winning and losing matches. This can be used to group each match into either a dominant win for each side or a close win.

first_blood

The variable ‘first_blood’ displays a right-skewed distribution which is expected since it would be rare for a game to go on too long without a single player dying. This variable maybe able to help define the win status as we might expect more earlier ‘first_blood’ kills recorded in strong win/loss scenarios and later ‘first_blood’ kills in closer games.

Other Key Variables

Here are other important variables to include that were not part of the histogram process above.

match_id

The ‘match_id’ variable is an integer number that Devin updated with his own numbers from 0 to 49,999, which we also find in ‘players’ data frame and are consistent accross all files. As discussed with Devin (see here https://www.kaggle.com/louisms/discussion link see ‘A quick look at Dota 2 dataset’) he did this to save space in his files since the orginal ‘match_id’ values from Dota 2 servers are much longer. This variable will be key to linking the ‘match’ and ‘players’ data frames together.

game_mode

The variable ‘game_mode’ was investigated as so.

GMTally <- MatchNewDur %>% group_by(game_mode) %>% tally()
GMTally

## # A tibble: 2 × 2
##   game_mode     n
##       <int> <int>
## 1         2  1316
## 2        22 48362

This shows that there were 1316 matches played as game mode ‘2’ and 48362 as game mode ‘22’. It may be worth exploring if there is any discernable difference between the two mode and wether this difference is enough to include or exclude matches under game mode ‘2’. It may depend upon the question as it is suspected that questions such as ‘which hero is most likely to create a win condition’ maybe independent of game type and as such no matches should be excluded based upon this.

radiant_win

The variable ‘radiant_win’ is a logical factor with either ‘True’ or ‘False’ indicating if the Radiant team won or lost the match. This variable is crutial for identifying who won each match however a new variable should be made so that it identifies if a team won or lost a match. This will allow for linking individual players and heroes to wins and losses. This will be done after ‘match’ and ‘players’ data frames are merged because information from both are required to determine this.

negative_votes and positive_votes

These variables are not considered to be influential on the win condition of a match, other than each players’ view as to how friendly other players were in the game. These variables will not initially be included in the data set, but will be remembered in case there is reason to return them to the study.

cluster

The ‘cluster’ variable refers to where in the world these games were hosted and is not relevent to the win condition for each match. This variable will not be included in the data set.

For this study the following variables from ‘match’ will be used:

match_id start_time duration tower_status_radiant tower_status_dire first_blood_time game_mode radiant_win

# Collect key variables from 'match' dataset
MatchDF <- MatchNewDur %>% select(1 : 5, 8 : 10)