Ethical Web Scraping: Dota 2 Hero Information

Author

J Markley

Background & Question

I have put a few hours into Dota 2, so I thought a little dive into hero starting stats would be interesting. While scraping the data I wanted was simple enough, creating the loop to scrape 124 different pages all rendered in java made for a small challenge. I intend to analyze the various heroes’ starting stats and how they might influence the heroes’ starting attack damage, armor, and movement speed. I am also interested in several other data points such as average starting HP by hero type, the distribution of starting attributes based on type, and the top 5 fastest heroes by weapon and/or type. Insights gained through this analysis can be beneficial when constructing team makeups and for determining the strongest and weakest heroes at the start of the game. I will incorporate this data into my final project, which will include using the Dota 2 API to gather my game history data and analyzing my past performances.

Libraries and Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr)      
library(rvest)      

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(lubridate) 
library(magrittr)   

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
library(stringr)   
library(chromote)   
library(polite)    
library(readr)      
library(dplyr)      

heroes <- 
  read_csv("https://myxavier-my.sharepoint.com/:x:/g/personal/markleyj3_xavier_edu/IQCo2dMt6CSTSKj05T05vBbIAbhKkzdMoEMUOGo_rZFLffg?download=1")
Rows: 124 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): hero_name, hero_type, hero_description, hero_weapon, hero_attack
dbl (7): hero_str, hero_agi, hero_int, hero_hp, hero_mana, hero_armor, hero_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Structure

The data was mostly ready for analysis, but there were a few issues with some characters’ hero_armor and hero_movement because the hero_attack, armor, and movement were all part of the same class and some of the melee characters lack a stat for “projectile speed” which is also part of the same class. This resulted in the incorrect numbers being scraped for 9 characters and I had to manually correct it. A tibble for reference:

as_tibble(heroes)
# A tibble: 124 × 12
   hero_name   hero_type hero_description hero_weapon hero_str hero_agi hero_int
   <chr>       <chr>     <chr>            <chr>          <dbl>    <dbl>    <dbl>
 1 Abaddon     Universal Shields his all… Melee             22       23       18
 2 Alchemist   Strength  Earns extra gol… Melee             23       22       25
 3 Ancient Ap… Intellig… Launches a powe… Ranged            20       20       23
 4 Anti-Mage   Agility   Slashes his foe… Melee             21       24       12
 5 Arc Warden  Universal Creates a copy … Ranged            20       20       24
 6 Axe         Strength  Taunts and forc… Melee             25       20       18
 7 Bane        Universal Puts his enemie… Ranged            23       23       23
 8 Batrider    Universal Can lasso an en… Ranged            23       13       22
 9 Beastmaster Universal Summons beasts … Melee             25       19       16
10 Bloodseeker Agility   Chases down low… Melee             24       24       17
# ℹ 114 more rows
# ℹ 5 more variables: hero_hp <dbl>, hero_mana <dbl>, hero_attack <chr>,
#   hero_armor <dbl>, hero_movement <dbl>

Data Analysis

There are 124 heroes in Dota 2, 60 of which are melee and 64 ranged - here is the breakdown of how they fall into their primary attribute categories:

heroes %>% 
  ggplot(aes(x = hero_weapon, fill = hero_type))+
  geom_bar()+
  labs(
    x = "Attack Type",          
    y = "Number of Heroes",    
    fill = "Hero Type") +
  theme_minimal()

As you can see, the majority of Strength heroes are melee, while the majority of Intelligence heroes are ranged. Agility and Universal heroes are split nearly evenly.

I then took a look at average starting HP by Primary Attribute groups as well:

heroes %>%
  ggplot(aes(x = hero_type, y = hero_hp, fill = hero_type)) +
  geom_boxplot() +
  facet_wrap(~ hero_type, scales = "free_x") +
  labs(
    x = "Hero Type",
    y = "HP",
    fill = "Hero Type",
    title = "Average Starting HP by Hero Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none") +
  theme(  
    panel.grid.major = element_line(color = "gray40", size = 0.5),
    panel.grid.minor = element_line(color = "gray70", size = 0.10),
    panel.background = element_rect(fill = "gray70"))
Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Strength heroes clearly have the highest starting HP, followed by Universal and then Intelligence and Agility. There are some outliers as well if you note the dot at the bottom of the Agility chart, that is Medusa, and she starts with a measly 120 HP!

I then got curious about the correlation between starting Str, Agi, and Int and heroes’ relative stats like hp, mana, armor, and movement speed:

heroes %>%
  ggplot(aes(x = hero_str, y = hero_hp)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    x = "Strength",
    y = "HP",
    title = "Correlation Between Strength and Starting HP"
  ) +
  theme_minimal() +
  theme(  
    panel.grid.major = element_line(color = "gray40", size = 0.5),
    panel.grid.minor = element_line(color = "gray70", size = 0.10),
    panel.background = element_rect(fill = "gray70"))
`geom_smooth()` using formula = 'y ~ x'

heroes %>% 
  summarize(correlation = cor(hero_str, hero_hp))
# A tibble: 1 × 1
  correlation
        <dbl>
1           1
heroes %>%
  ggplot(aes(x = hero_int, y = hero_mana)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(
    x = "Intelligence",
    y = "Mana",
    title = "Correlation Between Intelligence and Starting Mana"
  ) +
  theme_minimal() +
  theme(  
    panel.grid.major = element_line(color = "gray40", size = 0.5),
    panel.grid.minor = element_line(color = "gray70", size = 0.10),
    panel.background = element_rect(fill = "gray70"))
`geom_smooth()` using formula = 'y ~ x'

heroes %>%
  summarize(correlation = cor(hero_int, hero_mana))
# A tibble: 1 × 1
  correlation
        <dbl>
1       0.988
heroes %>%
  ggplot(aes(x = hero_agi, y = hero_armor)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  labs(
    x = "Agility",
    y = "Armor",
    title = "Correlation Between Agility and Starting Armor"
  ) +
  theme_minimal() +
  theme(  
    panel.grid.major = element_line(color = "gray40", size = 0.5),
    panel.grid.minor = element_line(color = "gray70", size = 0.10),
    panel.background = element_rect(fill = "gray70"))
`geom_smooth()` using formula = 'y ~ x'

heroes %>%
  summarize(correlation = cor(hero_agi, hero_armor))
# A tibble: 1 × 1
  correlation
        <dbl>
1       0.434
heroes %>%
  ggplot(aes(x = hero_agi, y = hero_movement)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
  labs(
    x = "Agility",
    y = "Movement Speed",
    title = "Correlation Between Agility and Starting Movement Speed"
  ) +
  theme_minimal() +
  theme(  
    panel.grid.major = element_line(color = "gray40", size = 0.5),
    panel.grid.minor = element_line(color = "gray70", size = 0.10),
    panel.background = element_rect(fill = "gray70"))
`geom_smooth()` using formula = 'y ~ x'

heroes %>%
  summarize(correlation = cor(hero_agi, hero_movement))
# A tibble: 1 × 1
  correlation
        <dbl>
1      0.0790

The correlation between strength and hp as well as intelligence and mana are undeniable, as you can see above. There is also a moderate correlation between starting agility and armor, and relatively no correlation whatsoever between starting agility and movement speed.

Movement and positioning are critical when playing Dota 2, so I was curious about who the fastest heroes are right out of the gate!

heroes %>%
  group_by(hero_type) %>%
  arrange(desc(hero_movement), .by_group = TRUE) %>%
  slice_head(n = 5) %>%
  ungroup() %>%
  ggplot(aes(x = reorder(hero_name, hero_movement), y = hero_movement, fill = hero_type)) +
  geom_col() +
  geom_text(aes(label = hero_movement), vjust = -0.1, size = 3) +
  facet_wrap(~hero_type, scales = "free_y") +
  
  labs(
    x = "Hero",
    y = "Movement Speed",
    title = "Top 5 Fastest Heroes by Type"
  ) +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "none")

There are fast heroes in each group, and if your playstyle is anything like mine, you will be selecting a hero with high movement speed and armor for the beginning of the game!