STDS - Scrapping data from HTML Web Page

1. Introduction

Accessing the data from various sources is critical to data science or analysis. We cross across various of websites everyday which may contain meaningful information for our data analysis work, unfortunately not all of them offer the general public with web APIs to access the data.

In this vignette, we will be exploring the ‘R’ way to extract information from web pages, relevant steps will be explained in terms of using ‘rvest’ and “SelectorGadget” tool to read through html code and output the data in a readable format for further analysis.

In this particular demo, we will be scrapping some data off Australian Reserve Bank Website.

This is a relatively simple web page which publishes latest foreign currency exchange rate agaist Aussie dollar, it will be useful for us to understand the look and feel with ‘rvest’ and “SelectorGadget” lets see how web scraping works in R.

2. Web HTML structure

HTML is constructed with a specific nested format, you will see various tags within the code itself, see below sample code from W3school website:

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p>This is a paragraph.</p>

</body>
</html>

Most of the time, we’re only interested in the ‘body’ of the html code where the most data will be extracting from.

library ‘rvest’ in R is one of the enablers of this web scrapping method. If you have ever worked with ‘beautifulsoup’ library in python, functionality wise, they’re very simlar toolkit to help you to use html tags to find the right nested information.

3. Discover data with SelectorGadget

SelectorGadget is an open source tool created by Andrew Catino and Kyle Maxwell. In a nutshell, it allows us to discover data in complicated web sites easily without going through complex CSS elements in the code. For folks who don’t have much web development knowledge, it is really a handy tool to use along with ‘rverst’.

4. Install SelectorGadget

To install SelectorGadget, open your Chrome browser and navigate to this SelectorGadget link. Post installation, SelectorGadget will appear over the top right corner of your Chrome Browser like any other extensions you might have, see below photo.

SelectorGadget on Chrome Extension

Ok, Installation has been completed.

Let’s load ‘rvest’ library and get our hands dirty on this.

library(rvest)

## Loading required package: xml2

5. Loading html code from RBA website

Here we’re using read_html() to read the contents of the html and then assign to a variable ‘html’ so that we can then harvest the data later on via referencing the CSS elements.

url <- 'https://www.rba.gov.au/statistics/frequency/exchange-rates.html'
html <- read_html(url)
html

## {html_document}
## <html lang="en-au" xml:lang="en-au" xmlns="http://www.w3.org/1999/xhtml" class="no-js">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n<div class="page" id="page-exchange-rates">\r\n\r\n\t\r\n\t<a i ...

6. SelectorGadget in Action

After the html code being captured, here We will be using SelectorGadget to capture the desired area on the web page where we want our data to come from.

1. First of all, open your Chrome browser and go to RBA website.

2. Click on the SelectorGadget browser extension from Chrome, at this moment the tool will be activated and ready to use.

3. Lastly, click on element you want to use for extraction, the elements that match the selector will be highlighted in yellow.

In this case, click on the currency column and the selector at the bottom of the page populated ‘th’ with all currency names in yellow (included 3 column names for dates etc., don’t worry, we will clean the data later).

Getting Currency Names

Over here, we will be passing ‘th’ element to html_nodes() function.

ex_currency <- html_nodes(html, 'th')
ex_currency <- html_text(ex_currency)
ex_currency

##  [1] "31 Mar 2020"                 "01 Apr 2020"                
##  [3] "02 Apr 2020"                 "United States dollar"       
##  [5] "Chinese renminbi"            "Japanese yen"               
##  [7] "European euro"               "South Korean won"           
##  [9] "Singapore dollar"            "New Zealand dollar"         
## [11] "UK pound sterling"           "Malaysian ringgit"          
## [13] "Thai baht"                   "Indonesian rupiah"          
## [15] "Indian rupee"                "New Taiwan dollar"          
## [17] "Vietnamese dong"             "Hong Kong dollar"           
## [19] "Papua New Guinea kina"       "Swiss franc"                
## [21] "United Arab Emirates dirham" "Canadian dollar"            
## [23] "Trade-weighted Index (4pm)"  "Special Drawing Right"

As we see the result above, we need to trim the irrelevant column names which were also included in this selection. To do this, we will be using index 4:22 to only include the currency names.

currency = ex_currency[4:22]
currency

##  [1] "United States dollar"        "Chinese renminbi"           
##  [3] "Japanese yen"                "European euro"              
##  [5] "South Korean won"            "Singapore dollar"           
##  [7] "New Zealand dollar"          "UK pound sterling"          
##  [9] "Malaysian ringgit"           "Thai baht"                  
## [11] "Indonesian rupiah"           "Indian rupee"               
## [13] "New Taiwan dollar"           "Vietnamese dong"            
## [15] "Hong Kong dollar"            "Papua New Guinea kina"      
## [17] "Swiss franc"                 "United Arab Emirates dirham"
## [19] "Canadian dollar"

Now, we have the right currency names we need too.

7. Exchange rate (Repeat above steps again)

Getting Latest Currency Rate

Likewise what we have done with the currency name column, let’s work on the currency rate column using SelectorGadget. This time, we will be passing through the “.highlight” selector in html_nodes() function. Again, there are some other irrelevant fields will need to be trimmed after this process.

ex_rate_latest <- html_nodes(html, '.highlight')
ex_rate_latest <- html_text(ex_rate_latest)
ex_rate_latest

##  [1] "0.6081" "4.3190" "65.36"  "0.5556" "750.62" "0.8711" "1.0249" "0.4908"
##  [9] "2.6623" "20.12"  "10061"  "45.94"  "18.41"  "14348"  "4.7145" "2.0825"
## [17] "0.5883" "2.2332" "0.8615" "54.0"   "0.4452"

Again, we will be using index to trim the data to only keep the currency rate.

rate = ex_rate_latest[1:19]
rate

##  [1] "0.6081" "4.3190" "65.36"  "0.5556" "750.62" "0.8711" "1.0249" "0.4908"
##  [9] "2.6623" "20.12"  "10061"  "45.94"  "18.41"  "14348"  "4.7145" "2.0825"
## [17] "0.5883" "2.2332" "0.8615"

Now, lets combine the two extractions to a dataframe by using data.frame() function, so it will a complete list of foreign exchange rate against currency names.

df_foreign_ex = data.frame(currency, rate)
df_foreign_ex

##                       currency   rate
## 1         United States dollar 0.6081
## 2             Chinese renminbi 4.3190
## 3                 Japanese yen  65.36
## 4                European euro 0.5556
## 5             South Korean won 750.62
## 6             Singapore dollar 0.8711
## 7           New Zealand dollar 1.0249
## 8            UK pound sterling 0.4908
## 9            Malaysian ringgit 2.6623
## 10                   Thai baht  20.12
## 11           Indonesian rupiah  10061
## 12                Indian rupee  45.94
## 13           New Taiwan dollar  18.41
## 14             Vietnamese dong  14348
## 15            Hong Kong dollar 4.7145
## 16       Papua New Guinea kina 2.0825
## 17                 Swiss franc 0.5883
## 18 United Arab Emirates dirham 2.2332
## 19             Canadian dollar 0.8615

Finally, We’re using write.csv() function to export the dataframe as a csv file. Sometimes, we might come across data/information we would like to re-use for future analysis. Let’s keep them on the hard drive in case R loses its memory.

write.csv(df_foreign_ex, file='stds_exchange_rate.csv', row.names = FALSE)

RAM

After all this, we have a csv file saved in the working directory where the project files are stored.

Additional Resources

Tidy web scraping in R — Tutorial and resources by Keith McNulty
SelectorGadget: point and click CSS selectors
An introduction to web scraping using R by Hiren Patel
W3schools sample html
Markdown quick reference cheat sheet