CUNY MSDS DATA 607 - HTML, JSON, XML

Nicholas Schettini

March 17, 2018

Libraries

library(RCurl)
library(tidyverse)
library(XML)
library(RJSONIO)
library(knitr)

Task:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have morethan one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

HTML

html_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/htmlhw.htm")
html_df <- html_hw %>%
  readHTMLTable() %>%
  data.frame()

colnames(html_df) <- str_replace(colnames(html_df),"NULL\\.", "")
colnames(html_df) <- str_replace(colnames(html_df),"\\.", " ")

kable(html_df)
Book Title Author Cover Type Subject Pages
A Brief History of Time Stephen Hawking Kindle Popular Science 256
Ready Player One Ernest Cline Soft Science Fiction 385
Automated Data Collection with R Simon Munzert Hard Data Science 452
Automated Data Collection with R Christian Rubba Hard Data Science 452
Automated Data Collection with R Peter Meibner Hard Data Science 452
Automated Data Collection with R Dominic Nyhuis Hard Data Science 452

XML

xml_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/xmlhw.xml")
xml_df <- xml_hw %>%
  xmlParse() %>%
  xmlToDataFrame()

kable(xml_df)
book_title authors cover_type subject pages
A Brief History of Time Stephen Hawking Soft Popular Science 256
Ready Player One Ernest Cline Soft Science Fiction 385
Automated Data Collection with R Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis Hard Data Science 452

JSON

json_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/jsonhw.json")
json_df <- fromJSON(json_hw)
json_df <- do.call("rbind", lapply(json_df$`favorite recent books`, data.frame, stringsAsFactors = F))

kable(json_df)
Book.Title Authors Cover.Type Subject Pages
A Brief History of Time Stephen Hawking Kindle Popular Science 256
Ready Player One Ernest Cline Soft Science Fiction 385
Automated Data Collection with R Simon Munzert Hard Data Science 452
Automated Data Collection with R Christian Rubba Hard Data Science 452
Automated Data Collection with R Peter Meibner Hard Data Science 452
Automated Data Collection with R Dominic Nyhuis Hard Data Science 452

Conclusions:

The three dataframes are not identical. The HTML and JSON dataframes turn out to be identical

html_df == json_df
##      Book Title Author Cover Type Subject Pages
## [1,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [2,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [3,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [4,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [5,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [6,]       TRUE   TRUE       TRUE    TRUE  TRUE

However, the XML dataframe is not. The Authors are stored under one section, instead of multiple like the others.