IS 607 – assignment 7 - Working with XML and JSON in R

By Md Forhad Akbar & Shovan Biswas

2019-10-09


rmarkdown output

We are using prettydoc with tactile theme, for our rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html

Problem Statement

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Git-Hub

The html, xml, json and .rmd file used in this assignmen can be found at: https://github.com/ShovanBiswas/DATA607/tree/master/Week7 and https://github.com/forhadakbar/data607fall2019/tree/master/Week%2007

The sections-1: HTML section

Read HTML

## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n\t\t<table id="Table1" class="regular" border="1" cellspacin ...

Display class of BooksHTML

## [1] "xml_document" "xml_node"

Extract table nodes

Since R does not readily accept HTML file as a dataframe, but stores it in dictionary of XML/HTML nodes, we we like to extract individual nodes, of our interest into a R variable. This HTML has many nodes, of which the main structure is within a table node (please refer original HTML file link). And we are interested in the main table node and possibly other nested table nodes. So, we’ll extract the main table node below.

## {xml_nodeset (2)}
## [1] <table id="Table1" class="regular" border="1" cellspacing="0" cellpa ...
## [2] <table id="Table2">\n<tr>\n<td>David M. Diez</td>\n\t\t\t\t\t\t</tr> ...

Extarct the table nodes to actual tables

Now, we’ll extract the tables from their respective nodes. But, that will not be readily available as a data frame, and due to the presence of nested table, the data will get mangled. But, we’ll handle them in the sequel. This step only shows the two tables extracted.

## [[1]]
##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                     David M. Diez
## 4             Mine Cetinkaya-Rundel
## 5                Christopher D Barr
## 6                        Advanced R
##                                                                Authors
## 1                                                     Thomas A. Powell
## 2 David M. Diez\n\t\t\t\t\t\tMine Cetinkaya-Rundel\n\t\t\t\t\t\tChristopher D Barr
## 3                                                                 <NA>
## 4                                                                 <NA>
## 5                                                                 <NA>
## 6                                                        Hadley Wikham
##         Subject                 Genre                 NA         NA
## 1   HTML Coding            Technology               <NA>       <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics
## 3          <NA>                  <NA>               <NA>       <NA>
## 4          <NA>                  <NA>               <NA>       <NA>
## 5          <NA>                  <NA>               <NA>       <NA>
## 6             R            Technology               <NA>       <NA>
##          NA
## 1      <NA>
## 2 Math/Stat
## 3      <NA>
## 4      <NA>
## 5      <NA>
## 6      <NA>
## 
## [[2]]
##                      X1
## 1         David M. Diez
## 2 Mine Cetinkaya-Rundel
## 3    Christopher D Barr

Extarct Table1

Now, we are interested only in the main table i.e. Table1. So, we’ll extract Table1 and convert to data frame. We observe that the data frame doesn’t contain the data in expected manner.

##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                     David M. Diez
## 4             Mine Cetinkaya-Rundel
## 5                Christopher D Barr
## 6                        Advanced R
##                                                                Authors
## 1                                                     Thomas A. Powell
## 2 David M. Diez\n\t\t\t\t\t\tMine Cetinkaya-Rundel\n\t\t\t\t\t\tChristopher D Barr
## 3                                                                 <NA>
## 4                                                                 <NA>
## 5                                                                 <NA>
## 6                                                        Hadley Wikham
##         Subject                 Genre                NA.      NA..1
## 1   HTML Coding            Technology               <NA>       <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics
## 3          <NA>                  <NA>               <NA>       <NA>
## 4          <NA>                  <NA>               <NA>       <NA>
## 5          <NA>                  <NA>               <NA>       <NA>
## 6             R            Technology               <NA>       <NA>
##       NA..2
## 1      <NA>
## 2 Math/Stat
## 3      <NA>
## 4      <NA>
## 5      <NA>
## 6      <NA>

Tidy up data frame

We observed that due to the presence of nested tables, the data in the data frame are displaced. So, we’ll tidy up the data, by rellocating them.

Title Authors Subject Genre
1 The Complete Reference HTML & CSS Thomas A. Powell HTML Coding Technology
2 OpenIntro Statistics David M. Diez , Mine Cetinkaya-Rundel , Christopher D Barr Statistics Math/Stat
6 Advanced R Hadley Wikham R Technology

Display class of BooksNodesTable1_df_tidy

## [1] "data.frame"

The sections-2: XML section

Read XML

## {xml_document}
## <books>
## [1] <book>\n  <Title>The Complete Reference HTML &amp; CSS</Title>\n  <A ...
## [2] <book>\n  <Title>OpenIntro Statistics</Title>\n  <Authors>\n    <Aut ...
## [3] <book>\n  <Title>Advanced R</Title>\n  <Author>Hadley Wikham</Author ...

Display class of BooksXML

## [1] "xml_document" "xml_node"

Populate empty data frame, with actual data

Populate the data frame. I put the 3 authors of a book, in an XML array. Since it’s not possible to accomodate an array in the cell of the target data frame, I flattened them out and concatenated them, as comma-separated string.

Title Author Subject Genre
The Complete Reference HTML & CSS Thomas A. Powell HTML Coding Technology
OpenIntro Statistics David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr Statistics Math/Stat
Advanced R Hadley Wikham R Technology

Display class of books_df

## [1] "data.frame"

The sections-3: JSON section

Read JSON

## $books
## $books$book
##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                        Advanced R
##                                                     Author     Subject
## 1                                         Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr  Statistics
## 3                                            Hadley Wikham           R
##        Genre
## 1 Technology
## 2  Math/Stat
## 3 Technology

Display class of BooksJSON

## [1] "list"

Convert to data frame

##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                        Advanced R
##                                                     Author     Subject
## 1                                         Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr  Statistics
## 3                                            Hadley Wikham           R
##        Genre
## 1 Technology
## 2  Math/Stat
## 3 Technology

Display class of BooksJSON_DF

## [1] "data.frame"

Conslusion

The data frames are almost same. But the HTML parsing mangled up the nested table, which we manually fixed.