As a data scientist, searching for extractable data can be a very cumbersome task. Often data will exist on a website with no functionality to download or extract the data easily.
Some websites will display data tables in HTML <table> format, however there is no option to download the data table. This can be problematic. Thankfully in this vignette I will show you how to use html_table and html_node from the rvest package to scrape a HTML table from a website and transform it into a data table in R. This vignette will provide you with some ideas on how to think about your data collection strategy and how to think about optimizing and structuring sequences of code.
The rvest package is nicely complemented by the magrittr package, which allows you to present and code arguments in a clean, elegant and easy to understand manner.
rvest and magrittr packageinstall.packages("vrest") #This package contains the html_node and html_table functions
install.packages("magrittr") #Easy to read piping functions
library(rvest)
library(magrittr)
In this example we will look at bitcoins historical OHLCV data from coinmarketcap.com.
Noting on April 1st 2020 coinmarket cap have introduced a new coin.Toilet Paper Coin :P
Coinmarket Cap Homepage
Webpage containing a HTML table
Firstly right click the webpage and click inspect. We will need to find the <table> tag.
Right click the <table> tag and click copy then copy XPath.
If you need help finding the right tag or want to learn more about the DNA of html tables please click here.
Figure 1
Now that we have the XPath copied to our clipboard let’s first understand a few basics.
It’s important to first understand what the assignment operator <- does.
url <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20190401&end=20200401"
The above means assign a “value” to what the operator <- is pointing at, this “value” will be what is on the right of the operator. In short you are assigning a value to a name.
Figure 2
Now lets add some complexity.
population <- url
The code above roughly translates to “add another variable deeper within an existing value”.
It’s now important to understand what the %>% piping function from magrittr does. In short "%>%" means you are pushing a argument/value before it, into another function after it.
Keep in mind it’s not only what it does, but also how it helps your code aesthetically and functionally compared to traditional function calls. To see some examples of %>% vs traditional piping function calls see this article.
population <- url %>%
By adding %>% we are now saying "add another variable deeper within a existing value and push this value into another function after the %>$
Lets now look at a complete code chunk and how it will pull html table data from a webpage.
url <- "https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20190401&end=20200401"
population <- url %>%
html %>%
html_nodes(xpath='/html/body/div[1]/div/div[2]/div[1]/div[2]/div[3]/div/ul[2]/li[5]/div/div/div
[2]/div[3]/div/table') %>%
html_table()
population <- population[[1]]
In respect to simply understanding what the above code chunk achieves, it roughly translates to:
Take the url value (that has just been assigned another value within its self called population) and pipe the url value into the html function. The html function will read a html file by first searching for one via the url string variable we assigned to the url value in the first line of code.
Since %>% is after html this means the parsing of the html file data will be pushed into the html_nodes function. The html_nodes function provides a filter that will only pass on the html table tags that have been identified from within the XPath inputted into the argument.
With the %>% after the XPath argument, the data collected from the XPath will be pushed into the html_table function. This will parse a html table into a data frame with no column headers.
population <- population[[1]] The subscript[[1]] within the code will tell your console to look for the first value from data table “population” and assign this first value as the reference to create column headers for your data table. The first value in this data table example is date.
Here is the data table we have just pulled. Thank you
Data Table