Using rvest with magrittr to neatly code and scrape html data tables from websites

Nicholas Drage

2020-04-06

As a data scientist, searching for extractable data can be a very cumbersome task. Often data will exist on a website with no functionality to download or extract the data easily.

Some websites will display data tables in HTML <table> format, however there is no option to download the data table. This can be problematic. Thankfully in this vignette I will show you how to use html_table and html_node from the rvest package to scrape a HTML table from a website and transform it into a data table in R. This vignette will provide you with some ideas on how to think about your data collection strategy and how to think about optimizing and structuring sequences of code.

The rvest package is nicely complemented by the magrittr package, which allows you to present and code arguments in a clean, elegant and easy to understand manner.

1 Install and load the rvest and magrittr package

install.packages("vrest") #This package contains the html_node and html_table functions
install.packages("magrittr") #Easy to read piping functions

library(rvest) 
library(magrittr)

In this example we will look at bitcoins historical OHLCV data from coinmarketcap.com.

Noting on April 1st 2020 coinmarket cap have introduced a new coin.Toilet Paper Coin :P

Coinmarket Cap Homepage

Coinmarket Cap Homepage

Webpage containing a HTML table

Webpage containing a HTML table

2 Copy the xpath from within inspect element tool

Firstly right click the webpage and click inspect. We will need to find the <table> tag.

Right click the <table> tag and click copy then copy XPath.

If you need help finding the right tag or want to learn more about the DNA of html tables please click here.

Figure 1

Figure 1

3 How to use operator and intro to piping

Now that we have the XPath copied to our clipboard let’s first understand a few basics.

It’s important to first understand what the assignment operator <- does.

url <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20190401&end=20200401"

The above means assign a “value” to what the operator <- is pointing at, this “value” will be what is on the right of the operator. In short you are assigning a value to a name.

Figure 2

Figure 2

Now lets add some complexity.

population <- url

The code above roughly translates to “add another variable deeper within an existing value”.

It’s now important to understand what the %>% piping function from magrittr does. In short "%>%" means you are pushing a argument/value before it, into another function after it.

Keep in mind it’s not only what it does, but also how it helps your code aesthetically and functionally compared to traditional function calls. To see some examples of %>% vs traditional piping function calls see this article.

population <- url %>%    

By adding %>% we are now saying "add another variable deeper within a existing value and push this value into another function after the %>$

4 Combining arguments

Lets now look at a complete code chunk and how it will pull html table data from a webpage.

url <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20190401&end=20200401"
population <- url %>%
html %>%
html_nodes(xpath='/html/body/div[1]/div/div[2]/div[1]/div[2]/div[3]/div/ul[2]/li[5]/div/div/div
    [2]/div[3]/div/table') %>%
html_table()
population <- population[[1]]

In respect to simply understanding what the above code chunk achieves, it roughly translates to:

Here is the data table we have just pulled. Thank you

Data Table

Data Table