Alphabet Soup: HTML, XML & JSON in R

Bonnie Cooper




Reading & writing standard formats for web data in R: HTML, XML & JSON



Table of Contents

    This demo will focus on reading and writing web data formats. However, we will start by scraping data on several books from the MIT Press website. The demo will proceed as follows:

  • Webscraping datapoints from 3 MIT Press books to an R data.frame
  • Writing the R data.frame to .html, .xml & .json files
  • Reading .html, .xml & .json files to R data.frames
  • Comparing data.frames from the different formats. OUR GOAL: the three data.frames should be the same.
  • Conclusions

Webscraping in R

    Here we will scrape information from 3 books published by MIT Press. MIT Press is the university press for the Massachusetts Institute of Technology and publishes many books and journals at the forefront of interdisciplinary studies in science, tech, art, social studies and design.

    We will start by initializing an R data.frame to hold the data points we will be scraping…

##   Title Authors Series Summary Date Price
## 1    NA      NA     NA      NA   NA    NA
## 2    NA      NA     NA      NA   NA    NA
## 3    NA      NA     NA      NA   NA    NA


    Now we will use a for loop to read each url with html_read() and extract specific information patterns with html_nodes(). Specific nodes are referenced in each call to html_nodes; the strings that specify each piece were found using the SelectorGadget extenson for Chrome. For more information, please see this excellent tutorial on Scraping HTML Text. Next, we will pipe the information & use dplyr & tidyverse methods to clean and format the datapoints. Each feature is passed to the appropriate column in the R data.frame we initialized in the previous code window.


    Some further cleaning of the data is necessary….



Writing the R data.frame to .html, .xml & .json files

    Let’s display the R data.frame as an html table:

Title Authors Summary Date Price
1 The Computational Brain, 25th Anniversary Edition By Patricia S. Churchland and Terrence J. Sejnowski An anniversary edition of the classic work that influenced a generation of neuroscientists and cognitive neuroscientists. November 2016 45
2 Visual Population Codes Edited by Nikolaus Kriegeskorte and Gabriel Kreiman How visual content is represented in neuronal population codes and how to analyze such codes with multivariate techniques. October 2011 19.75
3 Computational Modeling Methods for Neuroscientists Edited by Erik De Schutter A guide to computational modeling methods in neuroscience, covering a range of modeling scales from molecular reactions to large neural networks. September 2009 19.75


    Great!, what we would like to do now is write and export the data as .html, .xml & .json output files. Each file will contain the same information. However, it will written in one of three different outputs:

write .html

    xtable() function from the xtable library is used to format the data.frame to .html format.


write .xml

    write.xml() function from the kulife library is used to format the data.frame to .xml format.


write .json

    toJSON() from the rjson library & write() functions are used to format the data.frame to .json format.



Reading .html, .xml & .json files to R data.frames

    In the previous code blocks, we wrote & exported our R data.frame in three different formats. Next we will reverse the process and read .html, .xml & .json files into R as data.frames.     To do this, we need to access the different formats. For convenience of this demo, the output files have been uploaded to the author’s github account for easy access.

    Next we will access & read each file as an R data.frame…

read .html

    Here we pipe the URL for the html file to the functions read_html() and html_table() from the rvest library and then pipe to as.data.frame() to convert to an R data.frame. This process creates a feature, ‘Var.1’ holding the row numbers. we use dplyr select() to negate this column.


read .xml

    getURL() from the rcurl library is used here to load the data while xmlParse() and xmlToDataFrame() from the XML library are used to parse and convert the data to an R data.frame. Also, recast the ‘Price’ feature with as.numeric()


read .json

    read_json() from the jsonlite library is used to format the .json file to an R data.frame and as.data.frame() to recast to treat features as strings not factors.




Comparing data.frames from the different formats

    Wonderful!, we now have three R data.frames that we derived from .html, .xml & .json source material. Technically, they should all be identical because we wrote the source files with the content of the same data.frame. But are they?……

    We will use the dplyr function all_equal() to evaluate whether the data.frames are truly identical.

## [1] TRUE
## [1] TRUE
## [1] TRUE


    That’s fantastic!, we can see that all three comparisons with all_equal() return ‘TRUE’ therefore, we can rest assured that the data.frames are equivalent to each other.



Conclusions

In this demo we:

  • scraped information from three books from the web
  • formatted several key descriptors from the webpages html to an R data.frame
  • exported the data.frame into three different formats (.html, .xml & .json)
  • used the files generated in our code as source to upload as R data.frames
  • demonstrated that the data.frames are equivalent

    Reading web data in different formats required subtle manipulations to ensure proper format; there were no magic bullet functions that could handle writing to a data.frame flying solo. For example, writing .html data to an R data.frame required piping through multiple functions and writing .xml to a data.frame required recasting numeric data. However, we were able to accomplish our goal of building three equivalent R data.frames from different formats. This was facilitated by using multiple libraries to make out life easier and our code easier to read.



About the books….

    I’m an MIT Press fangirl! They always have a booth at the annual SfN meeting (The super bowl of Neuroscience) & I flock over to stuff my bag with wholesale priced inspiration. If I could, I would sing their praise from high on a mountain. The examples used in this demo are several MIT Press books of the many that I have pawed through over the years.