Assignment - Working with HTML, XML and JSON in R

library(kableExtra)

All three files are stored in a GitHub repository.

HTML

library(rvest)
library(dplyr)
htmlFile<-"https://raw.githubusercontent.com/pkowalchuk/Data607_Week7_Assignment/master/books.html"
htmlBooks<-tbl_df(as.data.frame(read_html(htmlFile) %>% html_table(header = NA, trim=TRUE, fill=TRUE)))
htmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
Tittle Authors Genres Pages AvailableInAudio
Algorithms to Live By: The Computer Science of Human Decisions Brian Christian, Tom Griffiths Science, Computer Science, Psychology, Non-fiction, Technology 368 Yes
String Theory David Foster Wallace Writing, Non-fiction, Essays, Science, Sports, Sports and Games 138 No
War and Peace Leo Tolstoy Novel 1225 Yes

XML

library(XML)
library(RCurl)
xmlfile<-getURL("https://raw.githubusercontent.com/pkowalchuk/Data607_Week7_Assignment/master/books.xml")
xmlBooks<-tbl_df(xmlToDataFrame(xmlfile))
xmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
Title Authors Genres Pages AvailableInAudio
Algorithms to Live By: The Computer Science of Human Decisions Brian Christian,Tom Griffiths Science,Computer Science,Psychology,Non-fiction,Technology, 368 Yes
String Theory David Foster Wallace Writing,Non-fiction,Essays,Science,Sports,Sports and Games 138 No
War and Peace Leo Tolstoy Novel 1225 Yes

JSON

library(RJSONIO)
library(jsonlite)
jsonFile<-"https://raw.githubusercontent.com/pkowalchuk/Data607_Week7_Assignment/master/books.json"
isValidJSON(jsonFile)
## [1] TRUE
jsonbooks<-tbl_df(as.data.frame(fromJSON(jsonFile)))
jsonbooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
books.Title books.Authors books.Genres books.Pages books.AvailableInAudio
Algorithms to Live By: The Computer Science of Human Decisions c(“Brian Christian”, “Tom Griffiths”) c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”) 368 Yes
String Theory David Foster Wallace c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) 138 No
War and Peace Leo Tolstoy Novel 1225 Yes

Are the three data frames identical?

Although all three data frames contain the same information, they are not structured the exact same way. All three data tables hold the child data for Authors and Genres a little differently, but we can manipulate the data frames such that the result is the same.

In HTML the different authors and genres are stored in the table as item separated by comas. That is the same way this data comes into the data frame as a string where the authors are separated by comas. We could if necesary build a vector with each item holding an author or a genre. This might make working with the data in the data frame easier if operations with individual authors were required.

library(stringr)
htmlBooks$Authors[1]<-str_split(htmlBooks$Authors[1],", ")
htmlBooks$Genres[1]<-str_split(htmlBooks$Genres[1],", ")
htmlBooks$Genres[2]<-str_split(htmlBooks$Genres[2],", ")
htmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
Tittle Authors Genres Pages AvailableInAudio
Algorithms to Live By: The Computer Science of Human Decisions c(“Brian Christian”, “Tom Griffiths”) c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”) 368 Yes
String Theory David Foster Wallace c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) 138 No
War and Peace Leo Tolstoy Novel 1225 Yes

In XML all the data is stored as factors. Althought we might choose to work with the data in this format, we can unfactor the data to make it match what we have in HTML.

library(varhandle)
xmlBooks$Title<-unfactor(xmlBooks$Title)
xmlBooks$Authors<-unfactor(xmlBooks$Authors)
xmlBooks$Genres<-unfactor(xmlBooks$Genres)
xmlBooks$Pages<-as.integer(unfactor(xmlBooks$Pages))
xmlBooks$AvailableInAudio<-unfactor(xmlBooks$AvailableInAudio)
#we split the authors to have then in a vector
xmlBooks$Authors<-c(str_split(xmlBooks$Authors[1],","),xmlBooks$Authors[1],xmlBooks$Authors[2])
#we do the same for genres
xmlBooks$Genres<-c(str_split(xmlBooks$Genres[1],","),str_split(xmlBooks$Genres[2],","),xmlBooks$Genres[3])
xmlBooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
Title Authors Genres Pages AvailableInAudio
Algorithms to Live By: The Computer Science of Human Decisions c(“Brian Christian”, “Tom Griffiths”) c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”, “”) 368 Yes
String Theory Brian Christian,Tom Griffiths c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) 138 No
War and Peace David Foster Wallace Novel 1225 Yes

In JSON we already have the Author and Genres columns of the data frame holding vectors for the different data items, so no further manipulation is required. For holding data this format seems the most efficient, with less overhead than XML, and geared towards storing and/or transmitting data rather than presenting it as in HTML. In JSON, we could change the names of the columns to match what we have in the other data frames. We can also change the Pages column to integers.

colnames(jsonbooks)<-c('Tittle','Authors','Genres','Pages','AvailableInAudio')
jsonbooks$Pages<-as.integer(jsonbooks$Pages)
jsonbooks %>% kable() %>% kable_styling() %>% scroll_box(width = "910px")
Tittle Authors Genres Pages AvailableInAudio
Algorithms to Live By: The Computer Science of Human Decisions c(“Brian Christian”, “Tom Griffiths”) c(“Science”, “Computer Science”, “Psychology”, “Non-fiction”, “Technology”) 368 Yes
String Theory David Foster Wallace c(“Writing”, “Non-fiction”, “Essays”, “Science”, “Sports”, “Sports and Games”) 138 No
War and Peace Leo Tolstoy Novel 1225 Yes

With these operations we have all three data frames with similar structures

htmlBooks
## # A tibble: 3 x 5
##   Tittle                           Authors  Genres  Pages AvailableInAudio
##   <chr>                            <list>   <list>  <int> <chr>           
## 1 Algorithms to Live By: The Comp… <chr [2… <chr […   368 Yes             
## 2 String Theory                    <chr [1… <chr […   138 No              
## 3 War and Peace                    <chr [1… <chr […  1225 Yes
xmlBooks
## # A tibble: 3 x 5
##   Title                            Authors  Genres  Pages AvailableInAudio
##   <chr>                            <list>   <list>  <int> <chr>           
## 1 Algorithms to Live By: The Comp… <chr [2… <chr […   368 Yes             
## 2 String Theory                    <chr [1… <chr […   138 No              
## 3 War and Peace                    <chr [1… <chr […  1225 Yes
jsonbooks
## # A tibble: 3 x 5
##   Tittle                           Authors  Genres  Pages AvailableInAudio
## * <chr>                            <list>   <list>  <int> <chr>           
## 1 Algorithms to Live By: The Comp… <chr [2… <chr […   368 Yes             
## 2 String Theory                    <chr [1… <chr […   138 No              
## 3 War and Peace                    <chr [1… <chr […  1225 Yes