Data Collection using Webscraping

This worksheet aims to provide a practical tutorial of how webscraping can be used to collect data for further analysis using R programming and the ‘rvest’ package.

Firstly, make sure you have the rvest and tidyverse packages installed and load the package using the library() function:

install.packages("rvest")
install.packages("tidyverse")

You will notice we have included two additional packages alongside ‘rvest’ and ‘tidyverse’ (both developed by Wickham, 2019).

Rvest enables easy collection of data from html webpages using a series of commands.

Tidyverse encompasses a number of powerful packages that work well together to optimise data analysis pipelines. In particular ‘rvest’ makes good use of the ‘magrittr’ to allow sequential sections of code to be run fluidly using ‘the pipe’ denoted by %>%. ‘Purrr’ is also used later in this workflow when applying the ‘rvest’ function across a vector to collect multiple elements from a web page (you will notice it is called explicitly using ‘purrr::’).

Then load the package by calling the library() function. Additional details about the package can be displayed by using ?rvest.

library(rvest, tidyverse)

## Loading required package: xml2

We will now build a webscraper that collects reviews left on Trustpilot. In this example, we will collect reviews of Oakfurniture Land.

https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk

The webscraper is comprised of 3 elements: 1. The page to be read, 2. The location of the element to be collected and 3. The type of information to be returned (i.e. be it text or a url link etc.)

Create an object that is the url of the webpage to be scraped.

url <- "https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk"

page <- read_html(url) #1st element

page %>%
  html_node(".review-content__text") %>% #Specifies location of text to be scraped
  html_text(trim = TRUE) #Returns data as text

## [1] "I love my new sideboard, delivered on time, beautifully..."

The trim = TRUE parameter in the html_text function removes line breaks from the responses. Try trim = FALSE to see the difference.

The SelectorGadget

".review-content__text" specifies the location of the information to be collected by the webscraper. You can easily generate these labels using a tool called ‘SelectorGadget’ developed by Hadley Wickham.

https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html

To use the selector gadget, visit the url above, drag and drop the hyperlink to the toolbar on your internet browser. Then click the SelectorGadget, click the element that you wish to collect on the page. It will then be green, all the subsequent elements that will also be collected are highlighted in yellow. If there are any additional elements that you do not wish to collect, simply click these and they will turn red and will be removed from the selection. Finally, once you have all the elements you want to collect highlighted in yellow, copy the code in the bottom right hand text box and copy it into the html_node() function. *Do not forget to include speech marks around the parameter.

Using SelectorGadget to select webpage elements for collection, source: Trustpilot

How can I collect all the reviews on that page?

The html_node() function will only return the first element of specified by the parameter. If you want to collect all of the elements that match, simply use html_nodes() instead.

page %>%
  html_nodes(".review-content__text") %>%
  html_text(trim = TRUE)

##  [1] "I love my new sideboard, delivered on time, beautifully..."      
##  [2] "My review only concerns the ordering stage. On line..."          
##  [3] "After booking a virtual visit we could not get..."               
##  [4] "Excellent service from start to finish, delivered on time..."    
##  [5] "It was delivered during 3 hour slot but driver..."               
##  [6] "After two months our oak table arrived from Vietnam..."          
##  [7] "We love our Oak Furnitureland chest of drawers and..."           
##  [8] "Delivered on time as specified, both delivery men extremely..."  
##  [9] "NA..."                                                           
## [10] "We received a damaged TV table yesterday from Oak..."            
## [11] "The guy on the phone was fantastic. Very helpful...."            
## [12] "Super service and very efficient.Thank you can’t wait for..."    
## [13] "Really very pleased with mirror ordered, exactly as described..."

Outputs can be saved as an object by including the object name and ‘<-’ in front of the code (in this case ‘Review’).

Data collected can be converted to a data frame and saved as a csv file using the following functions:

Review <- as.data.frame(Review)
write.csv(Review, file = "filename.csv")

How can I scrape reviews from multiple webpages?

Now you have managed to collect all reviews from a single web page, however as is often the case, the data you are interested in may be spread across many webpages. In this example, there are many webpages of reviews with approximately 18 reviews per page. Repeating the above code for all of the webpages would be tiresome, instead we can automate this process using a for-loop.

Notice that when you click on ‘page 2’ of the reviews the url of the webpage changes to ‘https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=2’. The ‘?page=2’ command changes with each subsequent page of reviews (?page=3, ?page=4 etc.). We can use a function to generate a list of urls for all of the pages we want to scrape data from.

Note that scraping many web pages can not only be computationally intensive, but can also send large number of requests to a server. When creating these requests it is important to behave ethically to avoid over burdening a server with requests. As a result, we will only collect reviews from the first 5 pages.

urls <- sprintf("https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=%d", 1:5)

urls

## [1] "https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=1"
## [2] "https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=2"
## [3] "https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=3"
## [4] "https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=4"
## [5] "https://uk.trustpilot.com/review/www.oakfurnitureland.co.uk?page=5"

Notice that when using sprintf() the %d is replaced by the range of values specified by the second parameter of the function (in this case the numbers 1 to 5).

Use a for loop to iterate through each webpage and collect the reviews from each page. Before calling the for loop, we need to create an empty dataframe in which we will add the data once collected.

AllRevs <- data.frame() #Creates an empty dataframe to store data

for (i in urls){
  page <- read_html(i)
  
  Review <- page %>%
    html_nodes(".review-content__text") %>%
    html_text(trim = TRUE)
  
  temp <- as.data.frame(Review)
  AllRevs <- rbind(AllRevs, temp)
}

AllRevs

Notice the addition of the last two lines of code

  temp <- as.data.frame(Review)
  AllRevs <- rbind(AllRevs, temp)

The first line creates a temporary dataframe of the reviews on that webpage (as before for the 1 webpage example), the second line uses the rbind function to bind the rows of data for each page together into a single dataframe. This function overwrites itself, therefore as the for loop progresses through each iteration, the data collected is appended to the AllRevs object.

nrow(AllRevs) #94 rows

We can count the number of rows in the AllRevs object to see how many reviews are collected. The webscraper collected 92 reviews from the 5 pages.

How can I collect multiple elements from each post?

Additional elements can be specified using the SelectorGadget and specifiying the name for the new object. We will then read the pages using a for loop and pass them to the map_df to run the webscraper functions. Install and load the dynutils package.

Use the SelectorGadget to select the entire review (shown below), then use mapdf() to iterate through each element of interest for each review. The code below also uses an if else statement to check whether a piece of data is missing. If the data is missing (i.e. length(.)==0) it will return NA.

Select the element that highlights the entire review, source: Truspilot

The code below iterates through each of the pages and collects the review ‘title’ and main ‘review’ text from each review.

Reviews <- data.frame() #Empty data frame

for (i in urls){
  page <- read_html(i)
  
  Scrape <- page %>% html_nodes(".review") %>%
    purrr::map_df(~list(title = html_nodes(.x, ".link--dark") %>% #Collect review title
                   html_text() %>%
                   {if(length(.) == 0) NA else .}, #Returns NA for missing data
                 
                 review = html_nodes(.x, ".review-content__text") %>% #Collect review
                   html_text(trim = TRUE) %>%
                   {if(length(.) == 0) NA else .}))
  
  temp <- data.frame(Scrape)
  Reviews <- rbind(temp, Reviews)
}

How can I anonymise usernames during data collection?

In some cases, it may be useful to know if an individual has posted more than once. You could identify this by collecting and anonymising usernames.

page_urls <- sprintf("https://uk.trustpilot.com/review/www.spotify.com?page=%d", 1:5)

Reviews <- data.frame()

for (i in page_urls){
  page <- read_html(i)
  
  Scrape <- page %>% html_nodes(".review") %>%
    purrr::map_df(~list(user = html_nodes(.x, ".consumer-information__name") %>% #Usernames
                   html_text(trim = TRUE) %>%
                   {if(length(.) == 0) NA else .},
                 
                 title = html_nodes(.x, ".link--dark") %>%
                   html_text(trim = TRUE) %>%
                   {if(length(.) == 0) NA else .},
                 
                 review = html_nodes(.x, ".review-content__text") %>%
                   html_text(trim = TRUE) %>%
                   {if(length(.) == 0) NA else .}))
  
  temp <- data.frame(Scrape)
  Reviews <- rbind(temp, Reviews)
  
  Reviews$user <- match(paste0(Reviews$user), unique(paste0(Reviews$user))) #Generates unique id for each new username
  Reviews$user <- paste0("User", Reviews$user) #Appends 'user' to the id number
}

The above code generates a dataframe object called ‘Reviews’. The dimensions of the dataframe are as follows:

dim(Reviews)

## [1] 100   3

The dataframe includes the user, review title and the actual review. Notice that the actual usernames have been anonymised and are replaced by ‘User’ and the number. This is acheived using two lines of code:

Reviews$user <- match(paste0(Reviews$user), unique(paste0(Reviews$user)))
Reviews$user <- paste0("User", Reviews$user)

The top line generates a number for each unique user in order of appearance. If the user appears more than once, the same number will be assigned. The second line takes this number and appends it to the word ‘user’.

In summary

Webscraping is an extremely powerful tool to collect large quantities of data, which can be further analysed. Depending on the data collected you can try to address any number of research topics. Make sure you obtain ethical approval if required for your research project before collecting and storing any data scraped.

References

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Wickham, H. (2019). rvest: Easily Harvest (Scrape) Web Pages. R package version 0.3.5. https://CRAN.R-project.org/package=rvest