We are using prettydoc with tactile theme, for our rmarkdown this week. prettydoc has great documentation in this link https://prettydoc.statr.me/index.html
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
The html, xml, json and .rmd file used in this assignmen can be found at: https://github.com/ShovanBiswas/DATA607/tree/master/Week7 and https://github.com/forhadakbar/data607fall2019/tree/master/Week%2007
BooksHTML <- read_html('https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2007/books.html')
BooksHTML## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n\t\t<table id="Table1" class="regular" border="1" cellspacin ...
Since R does not readily accept HTML file as a dataframe, but stores it in dictionary of XML/HTML nodes, we we like to extract individual nodes, of our interest into a R variable. This HTML has many nodes, of which the main structure is within a table node (please refer original HTML file link). And we are interested in the main table node and possibly other nested table nodes. So, we’ll extract the main table node below.
## {xml_nodeset (2)}
## [1] <table id="Table1" class="regular" border="1" cellspacing="0" cellpa ...
## [2] <table id="Table2">\n<tr>\n<td>David M. Diez</td>\n\t\t\t\t\t\t</tr> ...
Now, we’ll extract the tables from their respective nodes. But, that will not be readily available as a data frame, and due to the presence of nested table, the data will get mangled. But, we’ll handle them in the sequel. This step only shows the two tables extracted.
## [[1]]
## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 David M. Diez
## 4 Mine Cetinkaya-Rundel
## 5 Christopher D Barr
## 6 Advanced R
## Authors
## 1 Thomas A. Powell
## 2 David M. Diez\n\t\t\t\t\t\tMine Cetinkaya-Rundel\n\t\t\t\t\t\tChristopher D Barr
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 Hadley Wikham
## Subject Genre NA NA
## 1 HTML Coding Technology <NA> <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 R Technology <NA> <NA>
## NA
## 1 <NA>
## 2 Math/Stat
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
##
## [[2]]
## X1
## 1 David M. Diez
## 2 Mine Cetinkaya-Rundel
## 3 Christopher D Barr
Now, we are interested only in the main table i.e. Table1. So, we’ll extract Table1 and convert to data frame. We observe that the data frame doesn’t contain the data in expected manner.
BooksNodesTable1 <- BooksNodes %>% .[1:1] %>% html_table(fill = TRUE)
BooksNodesTable1_df <- data.frame(BooksNodesTable1)
BooksNodesTable1_df## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 David M. Diez
## 4 Mine Cetinkaya-Rundel
## 5 Christopher D Barr
## 6 Advanced R
## Authors
## 1 Thomas A. Powell
## 2 David M. Diez\n\t\t\t\t\t\tMine Cetinkaya-Rundel\n\t\t\t\t\t\tChristopher D Barr
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 Hadley Wikham
## Subject Genre NA. NA..1
## 1 HTML Coding Technology <NA> <NA>
## 2 David M. Diez Mine Cetinkaya-Rundel Christopher D Barr Statistics
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA>
## 6 R Technology <NA> <NA>
## NA..2
## 1 <NA>
## 2 Math/Stat
## 3 <NA>
## 4 <NA>
## 5 <NA>
## 6 <NA>
We observed that due to the presence of nested tables, the data in the data frame are displaced. So, we’ll tidy up the data, by rellocating them.
BooksNodesTable1_df[2, 3] <- BooksNodesTable1_df[2, 6]
BooksNodesTable1_df[2, 4] <- BooksNodesTable1_df[2, 7]
BooksNodesTable1_df[2, 2] <- paste(BooksNodesTable1_df[3, 1], ', ', BooksNodesTable1_df[4, 1], ', ', BooksNodesTable1_df[5, 1])
BooksNodesTable1_df_tidy <- (BooksNodesTable1_df[,1:4])[-c(3:5),]
kable(BooksNodesTable1_df_tidy)| Title | Authors | Subject | Genre | |
|---|---|---|---|---|
| 1 | The Complete Reference HTML & CSS | Thomas A. Powell | HTML Coding | Technology |
| 2 | OpenIntro Statistics | David M. Diez , Mine Cetinkaya-Rundel , Christopher D Barr | Statistics | Math/Stat |
| 6 | Advanced R | Hadley Wikham | R | Technology |
BooksXML <- read_xml('https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2007/books.xml')
BooksXML## {xml_document}
## <books>
## [1] <book>\n <Title>The Complete Reference HTML & CSS</Title>\n <A ...
## [2] <book>\n <Title>OpenIntro Statistics</Title>\n <Authors>\n <Aut ...
## [3] <book>\n <Title>Advanced R</Title>\n <Author>Hadley Wikham</Author ...
# Number of nodes (i.e. individual blocks of books) in books array
no_of_books <- xml_length(BooksXML)
# Size of each book node.
node_sz_books <- xml_length(xml_children(BooksXML))
# Max size of node_sz_books
max_node_sz_books <- max(node_sz_books)
# Index of the max node size.
pos_of_max <- which.max(node_sz_books)Populate the data frame. I put the 3 authors of a book, in an XML array. Since it’s not possible to accomodate an array in the cell of the target data frame, I flattened them out and concatenated them, as comma-separated string.
author_names <- ''
for (i in seq_len(no_of_books)) {
for (j in Heading) {
if(i == 2 && j == 'Author') { ## Flattening authors in this if-block.
for (k in 1:xml_length(xml_children(xml_children(BooksXML)[2])[2])) {
author_names <- paste0( author_names, xml_text(xml_children(xml_children(xml_children(BooksXML)[2])[2])[k]), ', ')
}
books_df[i, ][j] <- gsub(', $', '', author_names)
} else {
books_df[i, ][j] <- xml_text(xml_find_all(xml_child(BooksXML, i), paste0(".//", j)))
}
}
}
kable(books_df)| Title | Author | Subject | Genre |
|---|---|---|---|
| The Complete Reference HTML & CSS | Thomas A. Powell | HTML Coding | Technology |
| OpenIntro Statistics | David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr | Statistics | Math/Stat |
| Advanced R | Hadley Wikham | R | Technology |
BooksJSON <- fromJSON('https://raw.githubusercontent.com/forhadakbar/data607fall2019/master/Week%2007/books.json', flatten = TRUE)
BooksJSON## $books
## $books$book
## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 Advanced R
## Author Subject
## 1 Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr Statistics
## 3 Hadley Wikham R
## Genre
## 1 Technology
## 2 Math/Stat
## 3 Technology
BooksJSON_DF <- data.frame(BooksJSON)
names(BooksJSON_DF) <- c("Title", "Author", "Subject", "Genre")
BooksJSON_DF## Title
## 1 The Complete Reference HTML & CSS
## 2 OpenIntro Statistics
## 3 Advanced R
## Author Subject
## 1 Thomas A. Powell HTML Coding
## 2 David M. Diez, Mine Cetinkaya-Rundel, Christopher D Barr Statistics
## 3 Hadley Wikham R
## Genre
## 1 Technology
## 2 Math/Stat
## 3 Technology
The data frames are almost same. But the HTML parsing mangled up the nested table, which we manually fixed.