Read and Process HTML, XML and JSON tables

Problem Statement

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Git-Hub

The html, xml, json and .rmd file used in this assignmen can be found at: https://github.com/ShovanBiswas/DATA607/tree/master/Week07

Loading package

library(xml2)
library(rvest)
library(jsonlite)
library(prettydoc)
library(knitr)

The sections-1: HTML section

Read HTML

BooksHTML <- read_html('./books.html')
BooksHTML

## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n\t\t<table id="Table1" class="regular" border="1" cellspacing=" ...

Display class of BooksHTML

class(BooksHTML)

## [1] "xml_document" "xml_node"

Extract table nodes

Since R does not readily accept HTML file as a dataframe, but stores it in dictionary of XML/HTML nodes, I like to extract individual nodes, of our interest into a R variable. This HTML has many nodes, of which the main structure is within a table node (please refer original HTML file link). And I am interested in the main table node and possibly other nested table nodes. So, I’ll extract the main table node below.

BooksNodes <- BooksHTML %>% html_nodes('table')
BooksNodes

## {xml_nodeset (2)}
## [1] <table id="Table1" class="regular" border="1" cellspacing="0" cellpadding ...
## [2] <table id="Table2">\n<tr>\n<td>David M. Diez</td>\r\n\t\t\t\t\t\t</tr>\n< ...

Extarct the table nodes to actual tables

Now, I’ll extract the tables from their respective nodes. But, that will not be readily available as a data frame, and due to the presence of nested table, the data will get mangled. But, I’ll handle them in the sequel. This step only shows the two tables extracted.

BooksNodesTable <- BooksNodes %>% html_table(fill = TRUE)
BooksNodesTable

## [[1]]
##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                     David M. Diez
## 4             Mine Cetinkaya-Rundel
## 5                Christopher D Barr
## 6                        Advanced R
##                                                                    Authors
## 1                                                         Thomas A. Powell
## 2 David M. Diez\r\n\t\t\t\t\t\tMine Cetinkaya-Rundel\r\n\t\t\t\t\t\tChristopher D Barr
## 3                                                                     <NA>
## 4                                                                     <NA>
## 5                                                                     <NA>
## 6                                                            Hadley Wikham
##         Subject                 Genre                 NA         NA        NA
## 1   HTML Coding            Technology               <NA>       <NA>      <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics Math/Stat
## 3          <NA>                  <NA>               <NA>       <NA>      <NA>
## 4          <NA>                  <NA>               <NA>       <NA>      <NA>
## 5          <NA>                  <NA>               <NA>       <NA>      <NA>
## 6             R            Technology               <NA>       <NA>      <NA>
## 
## [[2]]
##                      X1
## 1         David M. Diez
## 2 Mine Cetinkaya-Rundel
## 3    Christopher D Barr

Extarct Table1

Now, I am interested only in the main table i.e. Table1. So, I’ll extract Table1 and convert to data frame. I observe that the data frame doesn’t contain the data in expected manner.

BooksNodesTable1 <- BooksNodes %>% .[1:1] %>% html_table(fill = TRUE)
BooksNodesTable1_df <- data.frame(BooksNodesTable1)
BooksNodesTable1_df

##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                     David M. Diez
## 4             Mine Cetinkaya-Rundel
## 5                Christopher D Barr
## 6                        Advanced R
##                                                                    Authors
## 1                                                         Thomas A. Powell
## 2 David M. Diez\r\n\t\t\t\t\t\tMine Cetinkaya-Rundel\r\n\t\t\t\t\t\tChristopher D Barr
## 3                                                                     <NA>
## 4                                                                     <NA>
## 5                                                                     <NA>
## 6                                                            Hadley Wikham
##         Subject                 Genre                NA.      NA..1     NA..2
## 1   HTML Coding            Technology               <NA>       <NA>      <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics Math/Stat
## 3          <NA>                  <NA>               <NA>       <NA>      <NA>
## 4          <NA>                  <NA>               <NA>       <NA>      <NA>
## 5          <NA>                  <NA>               <NA>       <NA>      <NA>
## 6             R            Technology               <NA>       <NA>      <NA>

Tidy up data frame

I observed that due to the presence of nested tables, the data in the data frame are displaced. So, I’ll tidy up the data, by rellocating them.

BooksNodesTable1_df[2, 3] <- BooksNodesTable1_df[2, 6]
BooksNodesTable1_df[2, 4] <- BooksNodesTable1_df[2, 7]
BooksNodesTable1_df[2, 2] <- paste(BooksNodesTable1_df[3, 1], ', ', BooksNodesTable1_df[4, 1], ', ', BooksNodesTable1_df[5, 1])
BooksNodesTable1_df_tidy <- (BooksNodesTable1_df[,1:4])[-c(3:5),]
kable(BooksNodesTable1_df_tidy)

	Title	Authors	Subject	Genre
1	The Complete Reference HTML & CSS	Thomas A. Powell	HTML Coding	Technology
2	OpenIntro Statistics	David M. Diez , Mine Cetinkaya-Rundel , Christopher D Barr	Statistics	Math/Stat
6	Advanced R	Hadley Wikham	R	Technology

Display class of BooksNodesTable1_df_tidy

class(BooksNodesTable1_df_tidy)

## [1] "data.frame"

The sections-2: XML section

Read XML

BooksXML <- read_xml('./books.xml')
BooksXML

## {xml_document}
## <books>
## [1] <book>\n  <Title>The Complete Reference HTML &amp; CSS</Title>\n  <Author ...
## [2] <book>\n  <Title>OpenIntro Statistics</Title>\n  <Authors>\n    <Author>D ...
## [3] <book>\n  <Title>Advanced R</Title>\n  <Author>Hadley Wikham</Author>\n   ...

Display class of BooksXML

class(BooksXML)

## [1] "xml_document" "xml_node"

Get important parameters, for processing BooksXML

# Number of nodes (i.e. individual blocks of books) in books array
no_of_books <- xml_length(BooksXML)

# Size of each book node.
node_sz_books <- xml_length(xml_children(BooksXML))

# Max size of node_sz_books
max_node_sz_books <- max(node_sz_books)

# Index of the max node size.
pos_of_max <- which.max(node_sz_books)

Create heading for the data frame

Heading <- xml_name(xml_children(xml_child(BooksXML, pos_of_max)))

Create empty data frame and initialize

books_df <- setNames(data.frame(matrix(ncol = max_node_sz_books, nrow = no_of_books), stringsAsFactors = FALSE), Heading)

Populate empty data frame, with actual data

Populate the data frame. I put the 3 authors of a book, in an XML array. Since it’s not possible to accomodate an array in the cell of the target data frame, I flattened them out and concatenated them, as comma-separated string.

author_names <- ''
for (i in seq_len(no_of_books)) {
  for (j in Heading) {
    if(i == 2 && j == 'Author') {  ## Flattening authors in this if-block.
      for (k in 1:xml_length(xml_children(xml_children(BooksXML)[2])[2])) {
        author_names <- paste0( author_names, xml_text(xml_children(xml_children(xml_children(BooksXML)[2])[2])[k]), ', ')
      }
      books_df[i, ][j] <- gsub(', $', '', author_names)
    } else {
       books_df[i, ][j] <- xml_text(xml_find_all(xml_child(BooksXML, i), paste0(".//", j)))
    }
  }
}
kable(books_df)

Title	Author	Subject	Genre
The Complete Reference HTML & CSS	Thomas A. Powell	HTML Coding	Technology
OpenIntro Statistics	David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr	Statistics	Math/Stat
Advanced R	Hadley Wikham	R	Technology

Display class of books_df

class(books_df)

## [1] "data.frame"

The sections-3: JSON section

Read JSON

BooksJSON <- fromJSON('./books.json', flatten = TRUE)
BooksJSON

## $books
## $books$book
##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                        Advanced R
##                                                     Author     Subject
## 1                                         Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr  Statistics
## 3                                            Hadley Wikham           R
##        Genre
## 1 Technology
## 2  Math/Stat
## 3 Technology

Display class of BooksJSON

class(BooksJSON)

## [1] "list"

Convert to data frame

BooksJSON_DF <- data.frame(BooksJSON)
names(BooksJSON_DF) <- c("Title", "Author", "Subject", "Genre")
BooksJSON_DF

##                               Title
## 1 The Complete Reference HTML & CSS
## 2              OpenIntro Statistics
## 3                        Advanced R
##                                                     Author     Subject
## 1                                         Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr  Statistics
## 3                                            Hadley Wikham           R
##        Genre
## 1 Technology
## 2  Math/Stat
## 3 Technology

Display class of BooksJSON_DF

class(BooksJSON_DF)

## [1] "data.frame"

Conslusion

The data frames are almost same. But the HTML parsing mangled up the nested table, which I manually fixed.

Marker: 607-07