Intro

We will do a web scraping using the rvest package in R. The data that we will scrape is best PC games of all time by metacritic. You can look at the website at https://www.metacritic.com/browse/games/score/metascore/all/pc/filtered?sort=desc.
We will draw several informations about the game including:
1. Title : The title of the game
2. Rank : The rank of the game based on the metascore
3. Metascore : The metascore of the game, given by critics
4. User Score : The score of the game, based on user ratings
5. Genre : The genre of the game
6. Rating : The age rating of the game
7. Release Date : The date of the game released into market

NOTE: Since the data is pulled from an online website, the result displayed here may be different by the time you run this code.

Web Scraping

To do web scraping, first we need to specify the URL.

URL <- "https://www.metacritic.com/browse/games/score/metascore/all/pc/filtered?sort=desc"

Then, we will read the html from the website using the rvestpackage.

library(rvest)
## Loading required package: xml2
webpage <- read_html(URL)

First, we try to pull the information about the title of the game. Since we only target specify object in the html, we will use the Selector Gadget, which you can download here https://selectorgadget.com/. If you are using Chrome as the browser, you can add the tool as extension.

.

.

Using the Selector Gadget, try click one of the game title.

.

.

As you can see, even though we choose the game title, some other element from the web is also chosen. Our next step is to click another irrelevant element so we can get the clean game title data. The first page is consists of 100 top game, so we know we get the right target when the number behind the Clear text is (100).

.

.

After we get the clean game title data, we will copy the text inside the box on the bottom.

.

.

Next, we read the title from the html and convert it into text

#get the title
title_html <- html_nodes(webpage,'#main .product_title a')

#convert title to text
title_text <- html_text(title_html)
head(title_text)
## [1] "\n                            Half-Life 2\n                                                    "                     
## [2] "\n                            Grand Theft Auto V\n                                                    "              
## [3] "\n                            The Orange Box\n                                                    "                  
## [4] "\n                            Half-Life\n                                                    "                       
## [5] "\n                            BioShock\n                                                    "                        
## [6] "\n                            Baldur's Gate II: Shadows of Amn\n                                                    "
#Clean the game title with gsub
##remove \n
title_text <- gsub("\n","",title_text)
##remove the first space
title_text <- substring(title_text,29)
##remove the last space
x <- "----------------------------------------------------"
title_text <- substring(title_text,1,last = nchar(title_text)-nchar(x) )
title_text <- gsub(" ","-",title_text)
head(title_text)
## [1] "Half-Life-2"                      "Grand-Theft-Auto-V"              
## [3] "The-Orange-Box"                   "Half-Life"                       
## [5] "BioShock"                         "Baldur's-Gate-II:-Shadows-of-Amn"

As you can see, we have succesfully acquired the title of the game. Next we will do the same step for the rest of the informations we want to get from the data.

#get the meta score
metascore_html <- html_nodes(webpage,'#main .positive')
metascore_text <- html_text(metascore_html)

## convert to numeric value
metascore <- as.numeric(metascore_text)
head(metascore)
## [1] 96 96 96 96 96 95
#get the user score
userscore_html <- html_nodes(webpage,'#main .textscore')
userscore_text <- html_text(userscore_html)
userscore <- as.numeric(userscore_text)*10
head(userscore)
## [1] 91 77 92 90 85 92
#get the release date
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
release_html <- html_nodes(webpage,".full_release_date .data")
release_text <- html_text(release_html)
release_date <- mdy(release_text)
head(release_date)
## [1] "2004-11-16" "2015-04-13" "2007-10-10" "1998-11-19" "2007-08-21"
## [6] "2000-09-24"
release_year <- year(release_date)
head(release_year)
## [1] 2004 2015 2007 1998 2007 2000
release_month <- month(release_date)
head(release_month)
## [1] 11  4 10 11  8  9
release_day <- weekdays(release_date)
head(release_day)
## [1] "Tuesday"   "Monday"    "Wednesday" "Thursday"  "Tuesday"   "Sunday"

The rating and genre of the game can only be acquired through each game webpage, so we need to do some extra steps to get the information. Luckily, metacritic has easy url to type, with each game respective website is located at metacritic.com/game/pc/game-title so all we need to do put the title we have already acquired into the website and scrape data from each website. This may take some times.

library(tidyverse)
## -- Attaching packages --------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x readr::guess_encoding()  masks rvest::guess_encoding()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x purrr::pluck()           masks rvest::pluck()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
genre <- data.frame()
rating <- matrix(nrow = 100)
title_add <- tolower(title_text)
title_add <- str_replace(title_add,"'","")
title_add <- str_replace(title_add,":","")

for (i in 1:100) {
URL_add <- "https://www.metacritic.com/game/pc/"

webpage_add <- read_html(paste(URL_add,title_add[i],sep = ""))

#Get genre and rating
rating_html <- html_nodes(webpage_add,".product_rating .data")
rating_text <- html_text(rating_html)
if (length(rating_text) != 0) {
rating[i] <- rating_text  
}
else {
rating[i] <- NA
}

#Lowercase the title
genre_html <- html_nodes(webpage_add,".product_genre .data")
genre_text <- html_text(genre_html)
genre_text <- unique(genre_text)
genre_i <- data.frame(genre1 = genre_text[1],
                         genre2 = genre_text[2],
                         genre3 = genre_text[3])
genre <- rbind(genre,genre_i)
}

head(rating)
##      [,1]
## [1,] "M" 
## [2,] "M" 
## [3,] "M" 
## [4,] "M" 
## [5,] "M" 
## [6,] "T"
head(genre)
##         genre1           genre2        genre3
## 1       Action          Shooter  First-Person
## 2       Modern Action Adventure    Open-World
## 3       Action    Miscellaneous       Shooter
## 4       Action          Shooter  First-Person
## 5       Action          Shooter  First-Person
## 6 Role-Playing     PC-style RPG Western-Style

Finally, we’ve gained all of informations required. Next, we will try to combine the data and visualize it, see if we can gain something from the data.

title_data <- gsub("-"," ",title_text)

#Combine the data into data frame
df <- data.frame(rank = c(1:100),
                 title = title_data,
                 rating = rating,
                 metascore = metascore,
                 userscore = userscore,
                 release_date = release_date,
                 year = release_year,
                 month = release_month,
                 day = release_day
                 )
df <- cbind(df,genre)

head(df)
##   rank                            title rating metascore userscore
## 1    1                      Half Life 2      M        96        91
## 2    2               Grand Theft Auto V      M        96        77
## 3    3                   The Orange Box      M        96        92
## 4    4                        Half Life      M        96        90
## 5    5                         BioShock      M        96        85
## 6    6 Baldur's Gate II: Shadows of Amn      T        95        92
##   release_date year month       day       genre1           genre2
## 1   2004-11-16 2004    11   Tuesday       Action          Shooter
## 2   2015-04-13 2015     4    Monday       Modern Action Adventure
## 3   2007-10-10 2007    10 Wednesday       Action    Miscellaneous
## 4   1998-11-19 1998    11  Thursday       Action          Shooter
## 5   2007-08-21 2007     8   Tuesday       Action          Shooter
## 6   2000-09-24 2000     9    Sunday Role-Playing     PC-style RPG
##          genre3
## 1  First-Person
## 2    Open-World
## 3       Shooter
## 4  First-Person
## 5  First-Person
## 6 Western-Style

First, we want to know if there is a big disrepancy between the metascore and the score given by the user.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
p <- df %>% 
  mutate(userscore = userscore,
         differ = metascore - userscore,
         meta = if_else(differ>0,"higher","lower")) %>% 
  ggplot(aes(group=1,
             text = paste("Game Title : ", title,
                          "<br> Metascore : ", metascore,
                          "<br> User Score : ", userscore)))+
  geom_col(aes(rank,differ,fill=meta))+
  theme(panel.background = element_blank())+
  labs(y= "Metascore - User Score", x = "Game Rank", fill = "Metascore",
       title = "Is the game metascore higher than the user score?")+
  scale_x_continuous(breaks = seq(0,100,5),expand = c(0,0))+
  scale_y_continuous(expand = c(0,0),breaks = seq(-5,70,10))

ggplotly(p,tooltip = "text")

As we can see, most of game metascores are higher than the score given by the user. One unique case is the The Witcher 3: Wild Hunt Blood and Wine, which has the same score on both end. On the bottom 25, some user score is higher than the metascore, suggesting that players may think the game deserve better score than the score given by the critics, even only by small margin. The biggest difference in score is Out of the Park Baseball 17. The user find the game dissapointing, with the user score only 33.

Next, we want to know what combination of genre and rating that has the highest score. Assuming that score given by critics are more objective and reflect the true value of the game, we will use the metascore. Since game has multiple genres, we will only use the first genre. We will compute the mean of metascore for games in the same combination.

df_combi <- df %>% na.omit() %>% 
  group_by(genre1,rating) %>% 
  summarise(metascore = mean(metascore), total = n())
p <- df_combi %>% 
  ggplot()+
  geom_tile(aes(rating,genre1,fill=metascore,size=total),color="#1D2024")+
  scale_fill_viridis_c(option = "B")+
  labs(title = "Combination of Genre and Rating",
       x = "Rating", y= "Genre")+
  theme(panel.background = element_blank())

ggplotly(p)

Highest score is achieved by the combination of Modern genre and rating of M. Meanwhile, the most popular game is Action with rating of M, with 15 number of games on this combination.

Final Note

To do web scraping, we can use the rvest package. To get the specific item on html, we used the Selector Gadget extension. We may need to do some cleaning before we use the data, such as cleaning the game title into something readable, convert numeric value, etc.. Now that you have followed all of the steps above, we hope you can do a web scraping with a website of your own choice.