Introduction to web-scraping using rvest and Posit Cloud

https://rpubs.com/hannahkreczak/webscrapingrvest

Hannah Kreczak

What is web scraping and why is it useful?

Technique for automatically extracting data and information from websites.
It allows you to transform unstructured or piecemeal information from websites into structured datasets (News and communications - GOV.UK (www.gov.uk)).
Without scraping, data would typically have to be accessed by manual browsing of sites and copying/pasting.

Therefore, web scraping can:
- Unlock new data and information – providing deeper insight,
- save a lot of time and effort,
- and reduce the risk of error

How to tell if web-scraping is suitable for a particular project

Likely to be suitable if…

Unlikely to be suitable if…

You want to access lots of information from a small number of sites

You want to access a little bit of info from lots of different sites

The info you want is presented in a consistent way throughout the site

There is no consistency in terms of how the information you want are presented

The site is open to the public and doesn’t mention scraping in T&Cs

The site is subscription-based and/or ‘forbids’ scraping

The data cannot be accessed more easily through other means

The data can easily be provided using other means – eg. just asking for it!

A five-step guide to web-scraping

Assess if web-scraping is suitable for your chosen website.
Develop a logical approach to navigating the site (i.e. the series of steps you would take to manually get the info you want).
Inspect the webpage and identify the HTML elements that contain the information you want.
Write and execute the web-scraping code.
Export the extracted information/carry out subsequent analysis.

A choice of different tools

Scrapers can be written in wide range of programming languages. Examples include:

Visual Basic for Applications (VBA)

Using references to MS Internet Controls and HTML Object Library.

Python

Using Beautiful Soup and Mechanize package.

R

Using Rvest and httr packages. We’ll use today through Posit Cloud! (https://posit.cloud)

Worked example – is web-scraping suitable for chosen website

Worked Example: How has the frequency of COVID-19 reporting changed over the last 4 years on gov.uk news?

Navigate to url we’re interested in https://www.gov.uk/search/news-and-communications …

…does this website satisfy the web-scraping project criteria?

Worked example - developing a logical approach to navigating

Navigate to https://www.gov.uk/search/news-and-communications.
Manually, take steps to filter the news results to show only articles relating to the topic of COVID-19.
Question to ask: does the page I’m viewing have all information I need available or will I need to take more steps to get all the data I want?
Question to ask: what happens to the url/urls and how will that inform my web-scraping algorithm?

Answer - developing a logical approach to navigating

Selecting the drop-down “Topic” and filtering the topic to “COVID-19” will reduce the news results to only those relating to COVID-19.
Doing this changes the url, therefore we’ll need to use the new url in our web-scraping.
All articles relating to COVID-19 are spread over multiple results pages, therefore to get all of them we’ll need to loop over all pages changing the url ending “&page=…”

NOTE: some web-pages are dynamically generated using Javascript, and the url does not update on user interaction. These pages require extra tools to web-scrape like the Selenium package.

HTML basics

HTML code is made up of different types of elements.
These elements can be nested: the title, paragraph and hyperlink are nested in the body of the HTML document.
Elements can have attributes, for example the hyperlink has a href attribute, which is the url it points too.

HTML Elements and Attributes

There are many types of elements that have different functions and uses in HTML. The examples to the left are elements you’ll see in the worked example.

Attributes are properties of the element, and there are some that are specific to the element they’re related too. Another important attribute is the class attribute: it is often used to point to a class name in a style sheet for the webpage.

Worked example: inspect webpage and identify HTML elements

Navigate to https://www.gov.uk/search/news-and-communications.
Right-click and select “Page source”
Find the HTML elements related to the content we want.
Make a note of the element names and unique attributes.

Answer: inspect webpage and identify HTML elements

[“…” indicate removed code, for clarity]

The {rvest} package

#install.packages("rvest")

library(rvest)

url<-"https://www.gov.uk/search/news-and-communications"

#Simulating a session in a html browser
simulated_session <- session(url)

#Extracting the elements from a website
extracted_element<-html_elements(simulated_session, 'element.name') 

#Extracts text inside an element
html_text(extracted_element) 

#Extracts the specified attribute attached to the element/elements
html_attr(extracted_element, 'attribute.name')


#Example - extracts hyperlinks or all attributes
html_attr(extracted.element, 'href')   #html_attr() gets a single attribute; 

html_attrs(extracted.element) #gets all attributes.

See Documentation, R-studio blog on rvest and a DataCamp Tutorial for more information.

Referencing HTML elements in R code

Use a combination of tags, classes and attirbutes as unique identifiers for the piece of information you’re after.

html_elements(‘simulated session’, ?)

E - element:

html_elements(simulated_session, "h1")

E[class=“bla”] – element with a class:

html_elements(simulated_session,'h1[class="bla"]')

E:nth-child(n) – nth child of the element E:

html_elements(simulated_session,'li[class="bla:nth-child(2)"]')

Worked example: write and execute the web-scraping code 1

To pull the article dates, from just the first page of new articles

library(rvest)

#This is the gov.uk news page url, for the first page of filtered to COVID-19 topic only
url<-"https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest"
#simulating an html session in R 
url_session<-session(url)

##Use "time" to get all time elements. First, specify a unique class for an element
## that contains the time elements
News_times<- url_session |>
  html_elements("time")|>  #give me all <time> elements in the page source code
  html_text()  # give me the text displayed on the webpage associated with these elements

Worked example: write and execute the web-scraping code 2

To pull the articles urls, for further web-scraping

library(rvest)

#This is the gov.uk news page url, for the first page of filtered to COVID-19 topic only
url<-"https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest"
#simulating an html session in R 
url_session<-session(url)
  
##More complex example - what if we wanted the article urls to investigate themes 
## in our further analysis?
News_urls<-url_session |>
  html_element('div[class="finder-results js-finder-results"]')|> #take the <div> element with this class
  html_elements("a")|> #take all the hyperlink <a> elements in the element stated above
  html_attr('href')|>  #give me the href attributes from those hyperlinks (NOT the text displayed)
  data.frame()

Worked example: write and execute the web-scraping code 3

To loop over the search results pages, to get all article dates

library(rvest)

Total_number_of_pages<-3

All_dates<-c() #create an empty vector to add dates in loop

for(i in 1:Total_number_of_pages){

#This url is updated with the page number i while looping
url<-paste0("https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest&page=",i)
#simulating an html session in R 
url_session<-session(url)

##Use "time" to get all time elements. First, specify a unique class for an element
## that contains the time elements
News_times<- url_session |>
  html_elements("time")|>  #give me all <time> elements in the page source code
  html_text()  # give me the text displayed on the webpage associated with these elements

#Add the pages dates to the list of all dates
All_dates<-append(All_dates,News_times)
}

Final thoughts

Always take the time to assess if a project is suitable for web-scraping (sometimes it’s not possible or you can save effort by requesting data!)
Take the time to understand the HTML and look for unique attributes (there may be more than one way to scrape what you need!)
The process is trial-and-error: always check your outputs along the way as you may be pulling data you might not expect.

Appendix: COVID 19 articles over-time

library(rvest)

Total_number_of_pages<-119

All_dates<-c() #create an empty vector to add dates in loop

for(i in 1:Total_number_of_pages){
  print(i)
  
  #This url is updated with the page number i while looping
  url<-paste0("https://www.gov.uk/search/news-and-communications?level_one_taxon=5b7b9532-a775-4bd2-a3aa-6ce380184b6c&order=updated-newest&page=",i)
  #simulating an html session in R 
  url_session<-session(url)
  
  ##Use "time" to get all time elements. First, specify a unique class for an element
  ## that contains the time elements
  News_times<- url_session |>
    html_elements("time")|>  #give me all <time> elements in the page source code
    html_text()  # give me the text displayed on the webpage associated with these elements
  
  #Add the pages dates to the list of all dates
  All_dates<-append(All_dates,News_times)
}

All_dates<-gsub(" January ","/1/",All_dates)
All_dates<-gsub(" February ","/2/",All_dates)
All_dates<-gsub(" March ","/3/",All_dates)
All_dates<-gsub(" April ","/4/",All_dates)
All_dates<-gsub(" May ","/5/",All_dates)
All_dates<-gsub(" June ","/6/",All_dates)
All_dates<-gsub(" July ","/7/",All_dates)
All_dates<-gsub(" August ","/8/",All_dates)
All_dates<-gsub(" September ","/9/",All_dates)
All_dates<-gsub(" October ","/10/",All_dates)
All_dates<-gsub(" November ","/11/",All_dates)
All_dates<-gsub(" December ","/12/",All_dates)

All_dates<-as.Date(All_dates,"%d/%m/%Y")

hist(All_dates,breaks="months",xlab="Articles relating to COVID-19 (per month)")