Read and Process HTML, XML and JSON tables
Problem Statement
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Git-Hub
The html, xml, json and .rmd file used in this assignmen can be found at: https://github.com/ShovanBiswas/DATA607/tree/master/Week07
Loading package
The sections-1: HTML section
Read HTML
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\r\n\t\t<table id="Table1" class="regular" border="1" cellspacing=" ...
Extract table nodes
Since R does not readily accept HTML file as a dataframe, but stores it in dictionary of XML/HTML nodes, I like to extract individual nodes, of our interest into a R variable. This HTML has many nodes, of which the main structure is within a table node (please refer original HTML file link). And I am interested in the main table node and possibly other nested table nodes. So, I’ll extract the main table node below.
## {xml_nodeset (2)}
## [1] <table id="Table1" class="regular" border="1" cellspacing="0" cellpadding ...
## [2] <table id="Table2">\n<tr>\n<td>David M. Diez</td>\r\n\t\t\t\t\t\t</tr>\n< ...
Extarct the table nodes to actual tables
Now, I’ll extract the tables from their respective nodes. But, that will not be readily available as a data frame, and due to the presence of nested table, the data will get mangled. But, I’ll handle them in the sequel. This step only shows the two tables extracted.
## [[1]]
## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 David M. Diez
## 4 Mine Cetinkaya-Rundel
## 5 Christopher D Barr
## 6 Advanced R
## Authors
## 1 Thomas A. Powell
## 2 David M. Diez\r\n\t\t\t\t\t\tMine Cetinkaya-Rundel\r\n\t\t\t\t\t\tChristopher D Barr
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 Hadley Wikham
## Subject Genre NA NA NA
## 1 HTML Coding Technology <NA> <NA> <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics Math/Stat
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 R Technology <NA> <NA> <NA>
##
## [[2]]
## X1
## 1 David M. Diez
## 2 Mine Cetinkaya-Rundel
## 3 Christopher D Barr
Extarct Table1
Now, I am interested only in the main table i.e. Table1. So, I’ll extract Table1 and convert to data frame. I observe that the data frame doesn’t contain the data in expected manner.
BooksNodesTable1 <- BooksNodes %>% .[1:1] %>% html_table(fill = TRUE)
BooksNodesTable1_df <- data.frame(BooksNodesTable1)
BooksNodesTable1_df## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 David M. Diez
## 4 Mine Cetinkaya-Rundel
## 5 Christopher D Barr
## 6 Advanced R
## Authors
## 1 Thomas A. Powell
## 2 David M. Diez\r\n\t\t\t\t\t\tMine Cetinkaya-Rundel\r\n\t\t\t\t\t\tChristopher D Barr
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 Hadley Wikham
## Subject Genre NA. NA..1 NA..2
## 1 HTML Coding Technology <NA> <NA> <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics Math/Stat
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA>
## 6 R Technology <NA> <NA> <NA>
Tidy up data frame
I observed that due to the presence of nested tables, the data in the data frame are displaced. So, I’ll tidy up the data, by rellocating them.
BooksNodesTable1_df[2, 3] <- BooksNodesTable1_df[2, 6]
BooksNodesTable1_df[2, 4] <- BooksNodesTable1_df[2, 7]
BooksNodesTable1_df[2, 2] <- paste(BooksNodesTable1_df[3, 1], ', ', BooksNodesTable1_df[4, 1], ', ', BooksNodesTable1_df[5, 1])
BooksNodesTable1_df_tidy <- (BooksNodesTable1_df[,1:4])[-c(3:5),]
kable(BooksNodesTable1_df_tidy)| Title | Authors | Subject | Genre | |
|---|---|---|---|---|
| 1 | The Complete Reference HTML & CSS | Thomas A. Powell | HTML Coding | Technology |
| 2 | OpenIntro Statistics | David M. Diez , Mine Cetinkaya-Rundel , Christopher D Barr | Statistics | Math/Stat |
| 6 | Advanced R | Hadley Wikham | R | Technology |
The sections-2: XML section
Read XML
## {xml_document}
## <books>
## [1] <book>\n <Title>The Complete Reference HTML & CSS</Title>\n <Author ...
## [2] <book>\n <Title>OpenIntro Statistics</Title>\n <Authors>\n <Author>D ...
## [3] <book>\n <Title>Advanced R</Title>\n <Author>Hadley Wikham</Author>\n ...
Get important parameters, for processing BooksXML
# Number of nodes (i.e. individual blocks of books) in books array
no_of_books <- xml_length(BooksXML)
# Size of each book node.
node_sz_books <- xml_length(xml_children(BooksXML))
# Max size of node_sz_books
max_node_sz_books <- max(node_sz_books)
# Index of the max node size.
pos_of_max <- which.max(node_sz_books)Create heading for the data frame
Create empty data frame and initialize
Populate empty data frame, with actual data
Populate the data frame. I put the 3 authors of a book, in an XML array. Since it’s not possible to accomodate an array in the cell of the target data frame, I flattened them out and concatenated them, as comma-separated string.
author_names <- ''
for (i in seq_len(no_of_books)) {
for (j in Heading) {
if(i == 2 && j == 'Author') { ## Flattening authors in this if-block.
for (k in 1:xml_length(xml_children(xml_children(BooksXML)[2])[2])) {
author_names <- paste0( author_names, xml_text(xml_children(xml_children(xml_children(BooksXML)[2])[2])[k]), ', ')
}
books_df[i, ][j] <- gsub(', $', '', author_names)
} else {
books_df[i, ][j] <- xml_text(xml_find_all(xml_child(BooksXML, i), paste0(".//", j)))
}
}
}
kable(books_df)| Title | Author | Subject | Genre |
|---|---|---|---|
| The Complete Reference HTML & CSS | Thomas A. Powell | HTML Coding | Technology |
| OpenIntro Statistics | David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr | Statistics | Math/Stat |
| Advanced R | Hadley Wikham | R | Technology |
The sections-3: JSON section
Read JSON
## $books
## $books$book
## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 Advanced R
## Author Subject
## 1 Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr Statistics
## 3 Hadley Wikham R
## Genre
## 1 Technology
## 2 Math/Stat
## 3 Technology
Convert to data frame
BooksJSON_DF <- data.frame(BooksJSON)
names(BooksJSON_DF) <- c("Title", "Author", "Subject", "Genre")
BooksJSON_DF## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 Advanced R
## Author Subject
## 1 Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr Statistics
## 3 Hadley Wikham R
## Genre
## 1 Technology
## 2 Math/Stat
## 3 Technology
Conslusion
The data frames are almost same. But the HTML parsing mangled up the nested table, which I manually fixed.
Marker: 607-07