Alphabet Soup: HTML, XML & JSON in R

Reading & writing standard formats for web data in R: HTML, XML & JSON

This demo will focus on reading and writing web data formats. However, we will start by scraping data on several books from the MIT Press website. The demo will proceed as follows:

Webscraping datapoints from 3 MIT Press books to an R data.frame
Writing the R data.frame to .html, .xml & .json files
Reading .html, .xml & .json files to R data.frames
Comparing data.frames from the different formats. OUR GOAL: the three data.frames should be the same.
Conclusions

R Libraries used in this demo:

library( rvest )
library( dplyr )
library( tidyr )
library( stringr )
library( htmlTable )
library( xtable )
library( kulife )
library( rjson )
library( jsonlite )
library( XML )
library( RCurl )

Webscraping in R

Here we will scrape information from 3 books published by MIT Press. MIT Press is the university press for the Massachusetts Institute of Technology and publishes many books and journals at the forefront of interdisciplinary studies in science, tech, art, social studies and design.

We will start by initializing an R data.frame to hold the data points we will be scraping…

#string pattern for MIT product pages
MIT_Stem <- 'https://mitpress.mit.edu/books/'
#ttiles for each book
bookTitles <- c( 'The Computational Brain', 'Visual Population Codes', 'Computational Modeling Methods' )
#string extention for each book
MIT_STRext <- c( 'computational-brain-25th-anniversary-edition', 'visual-population-codes', 'computational-modeling-methods-neuroscientists')
#initialize a R data.frame
productINFO <- data.frame( matrix( nrow=3, ncol=6 ) )
#give column names for each feature we will scrape
colnames( productINFO ) <- c( 'Title', 'Authors', 'Series', 'Summary', 'Date', 'Price')
#row names: 1 for each book we will scrape
#rownames( productINFO ) <- bookTitles
#display the result
productINFO

##   Title Authors Series Summary Date Price
## 1    NA      NA     NA      NA   NA    NA
## 2    NA      NA     NA      NA   NA    NA
## 3    NA      NA     NA      NA   NA    NA

Now we will use a for loop to read each url with html_read() and extract specific information patterns with html_nodes(). Specific nodes are referenced in each call to html_nodes; the strings that specify each piece were found using the SelectorGadget extenson for Chrome. For more information, please see this excellent tutorial on Scraping HTML Text. Next, we will pipe the information & use dplyr & tidyverse methods to clean and format the datapoints. Each feature is passed to the appropriate column in the R data.frame we initialized in the previous code window.

#for loop to iterate over each url the same scrape
for (abook in seq(1,length(bookTitles))){
    #concatenate the full url = MIT_stem + book extension
    url <- paste0( MIT_Stem, MIT_STRext[ abook ])
    #read_html() to read the url's html into R
    doc <- read_html( url )
    #html_nodes() to access specific html 
    productINFO$Title[abook] <- html_nodes(doc, ".book__title") %>%  
        html_text() %>% gsub("\n", "", .) %>% trimws()     
    productINFO$Authors[abook] <- html_nodes(doc, ".book__authors") %>%  
        html_text() %>% gsub("\n", "", .) %>% trimws()
    productINFO$Series[abook] <- html_nodes(doc, ".book__series") %>%  
        html_text() %>% gsub("\n", "", .) %>% trimws()
    productINFO$Summary[abook] <- html_nodes(doc, ".book__blurb") %>%  
        html_text() %>% gsub("\n", "", .) %>% trimws()
    productINFO$Date[abook] <- html_nodes(doc, "div time") %>% 
        html_text() %>% gsub("\n", "", .) %>% trimws()
    productINFO$Price[abook] <- html_nodes(doc, ".pricing-item__price") %>%
        html_text() %>% gsub("\n", "", .) %>% trimws()

}

Some further cleaning of the data is necessary….

#format Price
productINFO$Price <- productINFO$Price %>% str_match_all("^[$0-9.]+")
productINFO$Price <- as.numeric(gsub("[$]","", productINFO$Price))
#also, it seems that the 'Series' feature is redundant (all values are the same). We will remove it
productINFO <- productINFO %>% select( -Series )

Writing the R data.frame to .html, .xml & .json files

Let’s display the R data.frame as an html table:

htmlTable(productINFO)

	Title	Authors	Summary	Date	Price
1	The Computational Brain, 25th Anniversary Edition	By Patricia S. Churchland and Terrence J. Sejnowski	An anniversary edition of the classic work that influenced a generation of neuroscientists and cognitive neuroscientists.	November 2016	45
2	Visual Population Codes	Edited by Nikolaus Kriegeskorte and Gabriel Kreiman	How visual content is represented in neuronal population codes and how to analyze such codes with multivariate techniques.	October 2011	19.75
3	Computational Modeling Methods for Neuroscientists	Edited by Erik De Schutter	A guide to computational modeling methods in neuroscience, covering a range of modeling scales from molecular reactions to large neural networks.	September 2009	19.75

Great!, what we would like to do now is write and export the data as .html, .xml & .json output files. Each file will contain the same information. However, it will written in one of three different outputs:

write .html

xtable() function from the xtable library is used to format the data.frame to .html format.

out_table_x <- xtable( productINFO )
print(out_table_x, type='html', file="./htmlDATA.html")

write .xml

write.xml() function from the kulife library is used to format the data.frame to .xml format.

write.xml( productINFO, file="./xmlDATA.xml")

write .json

toJSON() from the rjson library & write() functions are used to format the data.frame to .json format.

jsonDATA <- toJSON( productINFO )
write( jsonDATA, 'jsonDATA.json')

Reading .html, .xml & .json files to R data.frames

In the previous code blocks, we wrote & exported our R data.frame in three different formats. Next we will reverse the process and read .html, .xml & .json files into R as data.frames. To do this, we need to access the different formats. For convenience of this demo, the output files have been uploaded to the author’s github account for easy access.

Next we will access & read each file as an R data.frame…

read .html

Here we pipe the URL for the html file to the functions read_html() and html_table() from the rvest library and then pipe to as.data.frame() to convert to an R data.frame. This process creates a feature, ‘Var.1’ holding the row numbers. we use dplyr select() to negate this column.

htmlURL <-'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/htmlDATA.html'
html_df <- htmlURL %>% read_html( ) %>% html_table() %>% as.data.frame() %>% select( -Var.1)

read .xml

getURL() from the rcurl library is used here to load the data while xmlParse() and xmlToDataFrame() from the XML library are used to parse and convert the data to an R data.frame. Also, recast the ‘Price’ feature with as.numeric()

xmlURL <- 'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/xmlDATA.xml'
xml_doc <- xmlParse( getURL( xmlURL ) )
xml_df <- xmlToDataFrame( xml_doc, stringsAsFactors = F )
xml_df$Price <- as.numeric( xml_df$Price )

read .json

read_json() from the jsonlite library is used to format the .json file to an R data.frame and as.data.frame() to recast to treat features as strings not factors.

jsonURL <-'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/jsonDATA.json'
json_df <- as.data.frame( read_json( jsonURL, simplifyVector =TRUE), stringsAsFactors =F )

Comparing data.frames from the different formats

Wonderful!, we now have three R data.frames that we derived from .html, .xml & .json source material. Technically, they should all be identical because we wrote the source files with the content of the same data.frame. But are they?……

We will use the dplyr function all_equal() to evaluate whether the data.frames are truly identical.

all_equal( html_df, xml_df )

## [1] TRUE

all_equal( html_df, json_df )

## [1] TRUE

all_equal( xml_df, json_df ) #this is redundant, but for completion's sake....

## [1] TRUE

That’s fantastic!, we can see that all three comparisons with all_equal() return ‘TRUE’ therefore, we can rest assured that the data.frames are equivalent to each other.

Conclusions

In this demo we:

scraped information from three books from the web
formatted several key descriptors from the webpages html to an R data.frame
exported the data.frame into three different formats (.html, .xml & .json)
used the files generated in our code as source to upload as R data.frames
demonstrated that the data.frames are equivalent

Reading web data in different formats required subtle manipulations to ensure proper format; there were no magic bullet functions that could handle writing to a data.frame flying solo. For example, writing .html data to an R data.frame required piping through multiple functions and writing .xml to a data.frame required recasting numeric data. However, we were able to accomplish our goal of building three equivalent R data.frames from different formats. This was facilitated by using multiple libraries to make out life easier and our code easier to read.

About the books….

I’m an MIT Press fangirl! They always have a booth at the annual SfN meeting (The super bowl of Neuroscience) & I flock over to stuff my bag with wholesale priced inspiration. If I could, I would sing their praise from high on a mountain. The examples used in this demo are several MIT Press books of the many that I have pawed through over the years.