CUNY MSDS DATA 607 - HTML, JSON, XML

Libraries

library(RCurl)
library(tidyverse)
library(XML)
library(RJSONIO)
library(knitr)

Task:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have morethan one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

HTML

html_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/htmlhw.htm")

html_df <- html_hw %>%
  readHTMLTable() %>%
  data.frame()

colnames(html_df) <- str_replace(colnames(html_df),"NULL\\.", "")
colnames(html_df) <- str_replace(colnames(html_df),"\\.", " ")

kable(html_df)

Book Title	Author	Cover Type	Subject	Pages
A Brief History of Time	Stephen Hawking	Kindle	Popular Science	256
Ready Player One	Ernest Cline	Soft	Science Fiction	385
Automated Data Collection with R	Simon Munzert	Hard	Data Science	452
Automated Data Collection with R	Christian Rubba	Hard	Data Science	452
Automated Data Collection with R	Peter Meibner	Hard	Data Science	452
Automated Data Collection with R	Dominic Nyhuis	Hard	Data Science	452

XML

xml_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/xmlhw.xml")

xml_df <- xml_hw %>%
  xmlParse() %>%
  xmlToDataFrame()

kable(xml_df)

book_title	authors	cover_type	subject	pages
A Brief History of Time	Stephen Hawking	Soft	Popular Science	256
Ready Player One	Ernest Cline	Soft	Science Fiction	385
Automated Data Collection with R	Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis	Hard	Data Science	452

JSON

json_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/jsonhw.json")

json_df <- fromJSON(json_hw)
json_df <- do.call("rbind", lapply(json_df$`favorite recent books`, data.frame, stringsAsFactors = F))

kable(json_df)

Book.Title	Authors	Cover.Type	Subject	Pages
A Brief History of Time	Stephen Hawking	Kindle	Popular Science	256
Ready Player One	Ernest Cline	Soft	Science Fiction	385
Automated Data Collection with R	Simon Munzert	Hard	Data Science	452
Automated Data Collection with R	Christian Rubba	Hard	Data Science	452
Automated Data Collection with R	Peter Meibner	Hard	Data Science	452
Automated Data Collection with R	Dominic Nyhuis	Hard	Data Science	452

Conclusions:

The three dataframes are not identical. The HTML and JSON dataframes turn out to be identical

html_df == json_df

##      Book Title Author Cover Type Subject Pages
## [1,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [2,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [3,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [4,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [5,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [6,]       TRUE   TRUE       TRUE    TRUE  TRUE

However, the XML dataframe is not. The Authors are stored under one section, instead of multiple like the others.