write.csv(CombinedDF, file = “CombinedDF.csv”)

This document takes the previous work done in ‘Dota2_Data_Wrangling.Rmd’ which produced the data frame ‘CombinedDF’. Lets take a quick look as a reminder of the variables collected.

# Read in 'CombinedDF.csv' file
CombinedDF <- read.csv(file = "CombinedDF.csv")

# Glimpse at 'CombinedDF'
glimpse(CombinedDF)
## Observations: 420,510
## Variables: 52
## $ X                         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1...
## $ match_id                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ start_time                <int> 1446750112, 1446750112, 1446750112, ...
## $ duration                  <int> 2375, 2375, 2375, 2375, 2375, 2375, ...
## $ tower_status_radiant      <int> 1982, 1982, 1982, 1982, 1982, 1982, ...
## $ tower_status_dire         <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1846, ...
## $ barracks_status_dire      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 63, 63...
## $ barracks_status_radiant   <int> 63, 63, 63, 63, 63, 63, 63, 63, 63, ...
## $ first_blood_time          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 221, 2...
## $ game_mode                 <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, ...
## $ radiant_win               <fctr> True, True, True, True, True, True,...
## $ date                      <fctr> 2015-11-05 19:01:52, 2015-11-05 19:...
## $ account_id                <int> 0, 1, 0, 2, 3, 4, 0, 5, 0, 6, 0, 7, ...
## $ hero_id                   <int> 86, 51, 83, 11, 67, 106, 102, 46, 7,...
## $ player_slot               <int> 0, 1, 2, 3, 4, 128, 129, 130, 131, 1...
## $ gold                      <int> 3261, 2954, 110, 1179, 3307, 476, 31...
## $ gold_spent                <int> 10960, 17760, 12195, 22505, 23825, 1...
## $ gold_per_min              <int> 347, 494, 350, 599, 613, 397, 303, 4...
## $ xp_per_min                <int> 362, 659, 385, 605, 762, 524, 369, 5...
## $ kills                     <int> 9, 13, 0, 8, 20, 5, 4, 4, 1, 1, 3, 9...
## $ deaths                    <int> 3, 3, 4, 4, 3, 6, 13, 8, 14, 11, 4, ...
## $ assists                   <int> 18, 18, 15, 19, 17, 8, 5, 6, 8, 6, 9...
## $ denies                    <int> 1, 9, 1, 6, 13, 5, 2, 31, 0, 0, 0, 9...
## $ last_hits                 <int> 30, 109, 58, 271, 245, 162, 107, 208...
## $ hero_damage               <int> 8690, 23747, 4217, 14832, 33740, 107...
## $ hero_healing              <int> 218, 0, 1595, 2714, 243, 0, 764, 0, ...
## $ tower_damage              <int> 143, 423, 399, 6055, 1833, 112, 0, 2...
## $ level                     <int> 16, 22, 17, 21, 24, 19, 16, 19, 12, ...
## $ xp_hero                   <int> 8840, 14331, 6692, 8583, 15814, 8502...
## $ xp_creep                  <int> 5440, 8440, 8112, 14230, 14325, 1225...
## $ xp_roshan                 <int> NA, 2683, NA, 894, NA, NA, NA, NA, N...
## $ gold_death                <int> -957, -1137, -1436, -2156, -1437, -2...
## $ gold_destroying_structure <int> 3120, 3299, 3142, 4714, 3217, 320, 3...
## $ gold_killing_heros        <int> 5145, 6676, 2418, 4104, 7467, 5281, ...
## $ gold_killing_creeps       <int> 1087, 4317, 3697, 10432, 9220, 6193,...
## $ gold_killing_roshan       <int> 400, 937, 400, 400, 400, NA, NA, NA,...
## $ unit_order_total          <int> 5041, 8385, 9167, 6396, 5588, 8197, ...
## $ team                      <fctr> R, R, R, R, R, D, D, D, D, D, R, R,...
## $ Hero_Names                <fctr> Rubick, Clockwerk, Treant Protector...
## $ All_Roles                 <fctr>  Dis   Sup  Nuk  ,  Dis Ini   Dur N...
## $ Carry_Car                 <fctr> NA, NA, NA, Car, Car, Car, Car, Car...
## $ Disabler_Dis              <fctr> Dis, Dis, Dis, NA, NA, Dis, NA, NA,...
## $ Initiator_Ini             <fctr> NA, Ini, Ini, NA, NA, Ini, NA, NA, ...
## $ Jungler_Jun               <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Support_Sup               <fctr> Sup, NA, Sup, NA, NA, NA, Sup, NA, ...
## $ Durable_Dur               <fctr> NA, Dur, Dur, NA, Dur, Dur, Dur, NA...
## $ Nuker_Nuk                 <fctr> Nuk, Nuk, NA, Nuk, NA, Nuk, NA, NA,...
## $ Pusher_Pus                <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Escape_Esc                <fctr> NA, NA, Esc, NA, Esc, Esc, NA, Esc,...
## $ Role_Count                <int> 3, 4, 5, 2, 3, 6, 3, 2, 4, 6, 4, 6, ...
## $ Class                     <fctr> INT, STR, STR, AGI, AGI, AGI, STR, ...
## $ WL                        <fctr> Win, Win, Win, Win, Win, Loss, Loss...

Gold Accumulation - Linear Regression Models

In this section we will address the question:

“How do I finish with a good amount of gold as a proxy to being ‘sucessful’ in the game?”

let’s first look at how gold compares to match duration.

# XY plot of duration against gold
ggplot(CombinedDF, aes(x = duration, y = gold, col = WL)) +
  geom_point(size = 0.2)

Here we see quite a broad range of gold values for different durations. Lets look at some heroes individually to see if we can clean up the analysis. Lets take a look at heroes’ popularity and the percentage of matches they tend to win.

# bar plot for most popular hero names 
Hero_Bar <- CombinedDF %>%
  group_by(Hero_Names, Class, Role_Count, All_Roles) %>%
  tally(sort = TRUE) %>%
  rename(Name=Hero_Names, HN_Count=n)

# Heroes win percentage (HWin_Pct)
HWin_Pct <- CombinedDF %>%
  group_by(Hero_Names, WL) %>%
  tally() %>%
  spread(key = WL, value = n) %>%
  rename(Name=Hero_Names) %>%
  mutate(Hero_Win_Pct = round(Win/(Win + Loss), digits = 2))

Hero_Bar <- left_join(Hero_Bar, HWin_Pct, by = "Name")

# ggplot of heroes ordered by popularity and color by their win percentage
ggplot(Hero_Bar, aes(x = reorder(Name, -HN_Count), y = HN_Count, width = 0.8, fill = Hero_Win_Pct)) +
  scale_fill_gradient2(low = "dark blue", high = "red", mid = "white", midpoint = 0.5) +
  geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 8)) +
  xlab("Heroes ordered by popularity") +
  ylab("Heroes Used Count")

This plot show a significant level of popularity over other heroes for the top two most popular; Windranger and Shadow Fiend. It is also noted that Windranger loses more than she wins (blue shading indicates a <0.5 value for ‘Hero_Win_Pct’) and Shadow Fiend wins more than he loses (red shading indication >0.5 value for ‘Hero_Win_Pct’), albeit these are both relatively close to the white 0.5 value for ‘Hero_Win_Pct’. Lets also call for the hero with the highest ‘Hero_Win_Pct’, a variable made by ‘win/(win+loss)’.

# ggplot of heroes ordered by win percentage and color by their popularity
ggplot(Hero_Bar, aes(x = reorder(Name, -Hero_Win_Pct), y = Hero_Win_Pct, width = 0.8, fill = HN_Count)) +
  scale_fill_gradient2(low = "dark blue", high = "red", mid = "white", midpoint = 5000) +
  geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 8)) +
  xlab("Heroes ordered by win percentage") +
  ylab("Heroes win percentage")

Here we see Omniknight has the highest win percentage of 0.59 . We will also examine this hero to get an indication of how this might vary the model.

Windranger Linear Model

Lets collect all data on Windranger and look at the duration vs gold XY plot.

# Selecting all Windranger data
Windranger <- CombinedDF[which(CombinedDF$Hero_Names=="Windranger"),]

# Windranger 'duration' vs 'gold' XY plot
ggplot(Windranger, aes(x = duration, y = gold, col = WL)) +
  geom_point(size = 0.2)

This plot shows Windranger’s match duration vs gold, coloured by Win/Loss. The Windranger data frame contains 17597 observations. Let’s look at gold and duration’s linear relationship for Windranger.

# first linear model where 'duration' is used to predict 'gold'
WR_LM1 <- lm(gold ~ duration, data=Windranger)
summary(WR_LM1)
## 
## Call:
## lm(formula = gold ~ duration, data = Windranger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2678.9 -1240.3  -544.0   854.9 11049.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 725.4510    55.2897   13.12   <2e-16 ***
## duration      0.4735     0.0215   22.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1696 on 17595 degrees of freedom
## Multiple R-squared:  0.02683,    Adjusted R-squared:  0.02678 
## F-statistic: 485.2 on 1 and 17595 DF,  p-value: < 2.2e-16

Model ‘WR_LM1’ shows both the intercept and duration regression coefficient have very low P values, indicating that we can definitely reject the null hypothesis that there is no correlation between these variables. Gold and duration do correlate slightly positively, however the very low R-Squared value of 0.02678 tells us that the model is not very good. Lets see if we can improve this model by including additional variables. Lets add the variable ‘gold_per_min’ to the model as this is directly linked to the gold variable and duration.

# second linear model where 'duration' combined with 'gold_per_min' is used to predict 'gold'
WR_LM2 <- lm(gold ~ duration * gold_per_min, data=Windranger)
summary(WR_LM2)
## 
## Call:
## lm(formula = gold ~ duration * gold_per_min, data = Windranger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4997.0  -941.6  -235.7   697.3  8032.6 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.212e+02  1.894e+02   2.752  0.00593 ** 
## duration              -1.001e+00  8.054e-02 -12.432  < 2e-16 ***
## gold_per_min           1.095e+00  4.228e-01   2.589  0.00963 ** 
## duration:gold_per_min  3.135e-03  1.793e-04  17.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1450 on 17593 degrees of freedom
## Multiple R-squared:  0.2891, Adjusted R-squared:  0.2889 
## F-statistic:  2384 on 3 and 17593 DF,  p-value: < 2.2e-16

Model ‘WR_LM2’ displays a low P value for all variables indicating that they correlateas would be expected. The faster a hero collects gold and the longer they have to do this the more gold they are likely to finish with. We also see a significant improvement in the R-squared value to 0.2889 indicating that combining ‘gold_per_min’ with ‘duration’ has really improved the model. Let’s see if we can improve on this model further by adding the ‘gold_spent’ variable as this will most definietly have a significant impact of gold in pocket at match end.

# third linear model where 'duration' combined with 'gold_per_min' plus 'gold_spent' is used to predict 'gold'
WR_LM3 <- lm(gold ~ duration * gold_per_min + gold_spent, data=Windranger)
summary(WR_LM3)
## 
## Call:
## lm(formula = gold ~ duration * gold_per_min + gold_spent, data = Windranger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5272.6  -693.2   -58.2   637.2  7005.8 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            3.287e+02  1.455e+02    2.259   0.0239 *  
## duration              -1.121e+00  6.189e-02  -18.119   <2e-16 ***
## gold_per_min           3.286e+00  3.254e-01   10.099   <2e-16 ***
## gold_spent            -5.190e-01  4.696e-03 -110.515   <2e-16 ***
## duration:gold_per_min  9.945e-03  1.509e-04   65.914   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1114 on 17592 degrees of freedom
## Multiple R-squared:  0.5804, Adjusted R-squared:  0.5803 
## F-statistic:  6083 on 4 and 17592 DF,  p-value: < 2.2e-16

Model ‘WR_LM3’ shows a very large improvement in R-squared to 0.5803 so that adding ‘gold_spent’ has improved the model significantly. Interestly the P value for the intercept has increased quite a bit, but still remains at a low value of 0.0239.

When a hero In Dota 2 dies the amount of gold is directly tied to their level. Lets add deaths and combine it with xp_per_min. Here we use xp_per_min because it is a continious variable that represents well the discrete variable ‘level’.

# fourth linear model where 'duration' combined with 'gold_per_min', 'gold_spent' and 'deaths' combined with 'xp_per_min' is used to predict 'gold'
WR_LM4 <- lm(gold ~ duration * gold_per_min + gold_spent + deaths * xp_per_min, data=Windranger)
summary(WR_LM4)
## 
## Call:
## lm(formula = gold ~ duration * gold_per_min + gold_spent + deaths * 
##     xp_per_min, data = Windranger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5425.0  -563.1   -44.7   511.6  7318.7 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            9.713e+02  1.340e+02    7.249 4.37e-13 ***
## duration              -8.573e-01  5.473e-02  -15.663  < 2e-16 ***
## gold_per_min           1.972e-01  3.160e-01    0.624    0.533    
## gold_spent            -5.991e-01  4.222e-03 -141.908  < 2e-16 ***
## deaths                -3.386e+00  8.959e+00   -0.378    0.706    
## xp_per_min             1.456e+00  1.718e-01    8.475  < 2e-16 ***
## duration:gold_per_min  1.172e-02  1.342e-04   87.336  < 2e-16 ***
## deaths:xp_per_min     -4.217e-01  1.835e-02  -22.985  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 967 on 17589 degrees of freedom
## Multiple R-squared:  0.6838, Adjusted R-squared:  0.6837 
## F-statistic:  5435 on 7 and 17589 DF,  p-value: < 2.2e-16

Model ‘WR_LM4’ shows further improvement in R-squared to 0.6837, which indicates a further increase in quality of the model.

We will include ‘hero_damage’ and ‘tower_damage’ into the model here since killing heroes and towers are significant sources of gold that players are encouraged to chase.

# fifth linear model where 'duration' combined with 'gold_per_min', 'gold_spent', 'deaths' combined with 'xp_per_min', 'hero_damage' and ' tower_damage' is used to predict 'gold'
WR_LM5 <- lm(gold ~ duration * gold_per_min + gold_spent + deaths * xp_per_min + hero_damage + tower_damage, data=Windranger)
summary(WR_LM5)
## 
## Call:
## lm(formula = gold ~ duration * gold_per_min + gold_spent + deaths * 
##     xp_per_min + hero_damage + tower_damage, data = Windranger)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5297.2  -561.2   -40.5   504.6  7213.6 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            1.404e+03  1.329e+02   10.563  < 2e-16 ***
## duration              -1.048e+00  5.493e-02  -19.077  < 2e-16 ***
## gold_per_min          -1.476e+00  3.184e-01   -4.637 3.56e-06 ***
## gold_spent            -5.903e-01  4.164e-03 -141.742  < 2e-16 ***
## deaths                 7.439e+00  8.867e+00    0.839    0.401    
## xp_per_min             2.290e+00  1.722e-01   13.301  < 2e-16 ***
## hero_damage           -3.874e-02  2.109e-03  -18.364  < 2e-16 ***
## tower_damage           7.413e-02  5.287e-03   14.022  < 2e-16 ***
## duration:gold_per_min  1.217e-02  1.397e-04   87.114  < 2e-16 ***
## deaths:xp_per_min     -4.088e-01  1.822e-02  -22.438  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 950 on 17587 degrees of freedom
## Multiple R-squared:  0.6949, Adjusted R-squared:  0.6947 
## F-statistic:  4450 on 9 and 17587 DF,  p-value: < 2.2e-16

Here we see model ‘WR_LM5’ a little improvement in R-squared to 0.6947. This model has the best R-squared value yet and is possibly the closest we will get to modelling the amount of gold a hero will likely finish a match with. We also note that all variables used in the model are very low apart from ‘deaths’ when modelled on its own. When included with ‘xp_per_min’ it remains significant.

Lets check this model with other heroes selected for this section.

Linear modelling Shadow Fiend

# Selecting all Shadow Fiend data
ShadowFiend <- CombinedDF[which(CombinedDF$Hero_Names=="Shadow Fiend"),]

# linear model of Shadow Fiend where 'duration' combined with 'gold_per_min', 'gold_spent', 'deaths' combined with 'xp_per_min', 'hero_damage' and ' tower_damage' is used to predict 'gold'
WR_LM5 <- lm(gold ~ duration * gold_per_min + gold_spent + deaths * xp_per_min + hero_damage + tower_damage, data=ShadowFiend)
summary(WR_LM5)
## 
## Call:
## lm(formula = gold ~ duration * gold_per_min + gold_spent + deaths * 
##     xp_per_min + hero_damage + tower_damage, data = ShadowFiend)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4418.8  -702.6   -99.0   605.9 28903.6 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            2.958e+03  1.959e+02   15.100  < 2e-16 ***
## duration              -1.733e+00  8.156e-02  -21.251  < 2e-16 ***
## gold_per_min          -3.670e+00  3.679e-01   -9.975  < 2e-16 ***
## gold_spent            -5.134e-01  4.463e-03 -115.040  < 2e-16 ***
## deaths                -4.189e+01  1.329e+01   -3.153  0.00162 ** 
## xp_per_min             1.677e+00  2.198e-01    7.628 2.54e-14 ***
## hero_damage           -3.422e-02  2.308e-03  -14.828  < 2e-16 ***
## tower_damage           1.135e-01  6.803e-03   16.679  < 2e-16 ***
## duration:gold_per_min  1.186e-02  1.634e-04   72.584  < 2e-16 ***
## deaths:xp_per_min     -2.613e-01  2.336e-02  -11.184  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1172 on 14278 degrees of freedom
## Multiple R-squared:   0.65,  Adjusted R-squared:  0.6498 
## F-statistic:  2947 on 9 and 14278 DF,  p-value: < 2.2e-16

Linear modelling Omniknight

# Selecting all Omniknight data
Omniknight <- CombinedDF[which(CombinedDF$Hero_Names=="Shadow Fiend"),]

# linear model of Omniknight where 'duration' combined with 'gold_per_min', 'gold_spent', 'deaths' combined with 'xp_per_min', 'hero_damage' and ' tower_damage' is used to predict 'gold'
WR_LM5 <- lm(gold ~ duration * gold_per_min + gold_spent + deaths * xp_per_min + hero_damage + tower_damage, data=Omniknight)
summary(WR_LM5)
## 
## Call:
## lm(formula = gold ~ duration * gold_per_min + gold_spent + deaths * 
##     xp_per_min + hero_damage + tower_damage, data = Omniknight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4418.8  -702.6   -99.0   605.9 28903.6 
## 
## Coefficients:
##                         Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            2.958e+03  1.959e+02   15.100  < 2e-16 ***
## duration              -1.733e+00  8.156e-02  -21.251  < 2e-16 ***
## gold_per_min          -3.670e+00  3.679e-01   -9.975  < 2e-16 ***
## gold_spent            -5.134e-01  4.463e-03 -115.040  < 2e-16 ***
## deaths                -4.189e+01  1.329e+01   -3.153  0.00162 ** 
## xp_per_min             1.677e+00  2.198e-01    7.628 2.54e-14 ***
## hero_damage           -3.422e-02  2.308e-03  -14.828  < 2e-16 ***
## tower_damage           1.135e-01  6.803e-03   16.679  < 2e-16 ***
## duration:gold_per_min  1.186e-02  1.634e-04   72.584  < 2e-16 ***
## deaths:xp_per_min     -2.613e-01  2.336e-02  -11.184  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1172 on 14278 degrees of freedom
## Multiple R-squared:   0.65,  Adjusted R-squared:  0.6498 
## F-statistic:  2947 on 9 and 14278 DF,  p-value: < 2.2e-16

The modelling of Shadow Fiend and Omniknight indicate that they have broadly similar responses to that of the Windranger dataset.

Gold Accumulation Conclusions

All models generated above showed that there was reason to believe that duration, gold_per_min, gold_spent, deaths, xp_per_min, hero_damage and tower_damage all correlate with the amount of gold a hero will finish with based upon the P-values for each variable. The best model ‘WR_LM5’ achieved an R-squared value of 0.6947 which indicates that variance from the model is fairly good. We can conclude that to finish the game with high amounts of gold players should try to maximise their rates of aqcuiring gold and xp, avoid deaths, takedown enemy players and towers and spend gold wisely at the shop.

Best ways to play - Logistic Regression Models

For this section we will look at a selection of heroes based upon how often they are selected for each match. The top two heroes most selected are ‘Windranger’ and ‘Shadow Fiend’.