Web Scraping in R using rvest

Introduction

In every data science project, data collection or data mining is an important step to obtain the data we need for analysis. There are various methods in data mining, one of them is callled web scrapping, a process of extracting (scraping) a wealth of useful data from text-based mark-up languages (HTML and kinds) which build up webpage.

In this article, we are going to perform web scraping from itch.io using R. Itch.io it self is a very interesting website; an open marketplace for independent digital creators with a focus on independent video games (indie games). It’s a platform that enables anyone to sell the content they’ve created.

itch.io is also a collection of some of the most unique, interesting, and independent creations you’ll find on the web. We’re not your typical digital storefront, with a wide range of both paid and free content, we encourage you to look around and see what you find. - Source: About itch.io

Below are the steps to scrap 30 Top Rated Free Games taged RPG Maker and Story Rich for Windows on itch.io using R. We are going to use rvest package for web scrapping.

Targeted Data:

Game Title
Developer
Rating Count
Rating Score
Story/Description
Size

# loading libs
library(rvest)
library(dplyr)

Initialize

To scrap a website, we need a html document or codes of the page we are scrapping from. From that, we are going to extract some text and information based on the element we specify. A side note, since the data is pulled from an online website, the result displayed here may be different by the time you run this code.

# prepare url to scrap
url <- "https://itch.io/games/top-rated/free/platform-windows/tag-rpgmaker/tag-story-rich"
# getting the html codes
itchio <- read_html(url)

itchio

## {xml_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body data-page_name="browse" data-host="itch.io" class="locale_en m ...

Now, try to inspect the element of the html using your browser by right-click on any location and select Inspect (Ctrl+Shift+I on Windows).

You are going to see a side-page full with html codes on the side of your screen/below. Those are the html codes we are going to work with and help us map our target. Now let’s try to extract the first element game title.

Game Title

Hover your mouse to the upper left of the inspect page where select element button is located (in this picture, it was colored blue). This button looks like a pointer inside a square.
Hover on the game title element, notice the change of codes which are highlighted on the inspect page.

We will be going to extract information from that specific div element. Take the class information to mark which element we are going to scrap.

#get the product title codes
title_html <- html_nodes(itchio, ".game_title")

title_html[[2]] #game which we hover earlier

## {xml_node}
## <div class="game_title">
## [1] <a class="title game_link" href="https://astralshiftpro.itch.io/pock ...

#convert codes to text
title_text <- html_text(title_html)

title_text

##  [1] "Wandering Wolf Trick"                    
##  [2] "Pocket Mirror"                           
##  [3] "Midnight Train"                          
##  [4] "KAIMA"                                   
##  [5] "Virgo Vs The Zodiac"                     
##  [6] "Lockheart Indigo"                        
##  [7] "Nina Aquila: Legal Eagle"                
##  [8] "The Roze Case"                           
##  [9] "1998"                                    
## [10] "Dead Dreams DEMO"                        
## [11] "Uncommon Time"                           
## [12] "Black Crystals"                          
## [13] "EMPTY HEAD"                              
## [14] "Path Out"                                
## [15] "The Clockwork Prince"                    
## [16] "The Chains that Bound Me"                
## [17] "Debbie's Dog Dilemma"                    
## [18] "Recondite: The Phantasm Emporium"        
## [19] "Super Lesbian Animal RPG - DEMO VERSION" 
## [20] "Halloween Night"                         
## [21] "Black, White and Grey"                   
## [22] "The Midnight Train to Nowhere"           
## [23] "Fantasya Final Definitiva REMAKE"        
## [24] "The Witch-in-Training and the Magic Seal"
## [25] "Crystal Confines"                        
## [26] "DAD FIGHTER 30XX (DEMO)"                 
## [27] "BoxxyQuest: The Gathering Storm"         
## [28] "Railroad Tracks"                         
## [29] "Crossroads Of Fate"                      
## [30] "Easter"

You can perform the same basic steps to extract the other element!

Developer

# get author/developer
author_html <- html_nodes(itchio, ".game_author")
author_text <- html_text(author_html)

author_text

##  [1] "Nami"              "AstralShift"       "Lydia"            
##  [4] "Nami"              "Nana"              "HarmlessGames"    
##  [7] "Ethan Fox"         "Michelle Bond"     "averageavacado"   
## [10] "Aiaz Marx"         "feralphoenix"      "Yoraee"           
## [13] "cutiesbae"         "causacreations"    "Zmakesgames"      
## [16] "Yobob"             "Well Done Studios" "Suoish"           
## [19] "ponett"            "SoloDevKingdom"    "Curse"            
## [22] "mannytsu"          "CleanWaterSoft"    "feralphoenix"     
## [25] "Ichigo"            "Plum"              "SpherianGames"    
## [28] "socah"             "Blacksred Empire"  "carrotpatchgames"

Rating Count

library(stringr) # since we're going to deal with character
rating_html <- html_nodes(itchio, ".game_rating")

rating_html

## {xml_nodeset (30)}
##  [1] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [2] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [3] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [4] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [5] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [6] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [7] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [8] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
##  [9] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [10] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [11] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [12] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [13] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [14] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [15] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [16] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [17] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [18] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [19] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## [20] <div class="game_rating">\n<div class="star_value">\n<div style="wi ...
## ...

string<- c("[[:punct:]]") # prepare string to remove

rating_count <- html_text(rating_html) %>% 
  str_remove_all(pattern = string) %>% 
  str_squish() %>% as.numeric()

rating_count

##  [1] 287 216  66 163  56  34  41  11  27  16   9   9  16  15   9   8   8
## [18]   7   6   6   9  15   5   5  13   4   4   4   4   9

Rating Score

rating_score <- itchio %>%
  html_nodes("div") %>%
  html_nodes(".game_rating") %>%
  html_nodes("span") %>% 
  html_nodes(xpath = '//*[@class="rating_count"]') %>% 
  html_attr("title") %>% 
  as.numeric()

rating_score

##  [1] 4.83 4.75 4.88 4.50 4.68 4.76 4.71 4.91 4.59 4.69 5.00 4.89 4.50 4.40
## [15] 4.67 4.88 4.50 4.71 5.00 4.83 4.11 3.87 5.00 4.80 3.69 5.00 5.00 5.00
## [29] 5.00 3.67

Game Details

Some other details need to be scraped from a page linked from the original page (the link located in the thumbnail/title). Note on how I tried to extract the link from the div with class title game_link, and extract the link (href attribute).

# getting game links from a page
  links <- itchio %>% 
    html_nodes("div") %>%
    html_nodes(xpath = '//*[@class="title game_link"]') %>%
    rvest::html_attr("href")

head(links)

## [1] "https://nomnomnami.itch.io/wandering-wolf-trick"
## [2] "https://astralshiftpro.itch.io/pocket-mirror"   
## [3] "https://lyd1a.itch.io/midnight-train"           
## [4] "https://nomnomnami.itch.io/kaima"               
## [5] "https://moonana.itch.io/virgovsthezodiac"       
## [6] "https://harmlessgames.itch.io/lockheart-indigo"

We are going to get the story/description information automatically using for loop. We are going to use the links we have scraped before for the for loop.

# prepare object to be filled with for loop
  story <- c() # for description

# for loop
# these codes below will fill `story` with the full description for each game links; [i] for link index
  
for(i in seq_along(links)) {
    
    url2 <- links[i]
    page <- xml2::read_html(url2)
    
    page_html <- page %>%
      html_nodes("div")

  story[[i]] <- page_html %>%
      html_nodes(xpath = '//*[@class="formatted_description user_formatted"]') %>%
      html_text()
   
   }

itchio_wrap <- data.frame(title_text, author_text, rating_score, rating_count, story, links)

itchio_wrap

Visualization

Below is an additional analysis and the visualization of Game Rating and Popularity. Let’s try to filter and arrange those games based on their rating count and score! These games might be the one for you to try on the next weekend or to be put into your wishlist. For my personal convinience, I’d like to take only the games with rating score higher than 4.5 and have a rating count at least 10. From there on, I’d like try games with higher rating score first.

# data aggregation
itchio_arr <- itchio_wrap %>% 
  filter(rating_count >= 10, rating_score >= 4.5) %>% 
  arrange(desc(rating_score, rating_count)) %>% 
  select(-story, -links)

itchio_arr

# visualization
library(ggplot2)

plot <- ggplot(itchio_arr, aes(x=reorder(title_text, rating_score), y=rating_score)) +
  geom_point(aes(size = rating_count, color = rating_score)) + coord_flip() +
  labs(x = "",
       y = "Score",
       title = "Highest Rating Free Adventure-RPG Games on Itch.io",
       subtitle = "Filtered for Windows Platform",
       size = "Rating Count") +
  scale_color_continuous(low = "pink", high = "maroon") + 
  scale_size_continuous(breaks = c(25,50,100,200)) + 
  guides(color = F) +
  theme_minimal()

plot

This is only example of how we can scrap a website using rvest and further analyze it for data visualization. There is so much more to discover! Further development and expansion of these data can be utilized for building an personalized game recommendation app (based on genre, platform, etc). It can also be used for indie game analysis project.

This article hopefully helps you to understand better about the mechanism in data scrapping and can enrich your basic knowledge on how to perform web scraping. Happy learning!