Libraries
library(RCurl)
library(tidyverse)
library(XML)
library(RJSONIO)
library(knitr)
Task:
Pick three of your favorite books on one of your favorite subjects. At least one of the books should have morethan one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
HTML
html_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/htmlhw.htm")
html_df <- html_hw %>%
readHTMLTable() %>%
data.frame()
colnames(html_df) <- str_replace(colnames(html_df),"NULL\\.", "")
colnames(html_df) <- str_replace(colnames(html_df),"\\.", " ")
kable(html_df)
Book Title | Author | Cover Type | Subject | Pages |
---|---|---|---|---|
A Brief History of Time | Stephen Hawking | Kindle | Popular Science | 256 |
Ready Player One | Ernest Cline | Soft | Science Fiction | 385 |
Automated Data Collection with R | Simon Munzert | Hard | Data Science | 452 |
Automated Data Collection with R | Christian Rubba | Hard | Data Science | 452 |
Automated Data Collection with R | Peter Meibner | Hard | Data Science | 452 |
Automated Data Collection with R | Dominic Nyhuis | Hard | Data Science | 452 |
XML
xml_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/xmlhw.xml")
xml_df <- xml_hw %>%
xmlParse() %>%
xmlToDataFrame()
kable(xml_df)
book_title | authors | cover_type | subject | pages |
---|---|---|---|---|
A Brief History of Time | Stephen Hawking | Soft | Popular Science | 256 |
Ready Player One | Ernest Cline | Soft | Science Fiction | 385 |
Automated Data Collection with R | Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis | Hard | Data Science | 452 |
JSON
json_hw <- getURLContent("https://rawgit.com/nschettini/CUNY-MSDS-DATA-607/master/jsonhw.json")
json_df <- fromJSON(json_hw)
json_df <- do.call("rbind", lapply(json_df$`favorite recent books`, data.frame, stringsAsFactors = F))
kable(json_df)
Book.Title | Authors | Cover.Type | Subject | Pages |
---|---|---|---|---|
A Brief History of Time | Stephen Hawking | Kindle | Popular Science | 256 |
Ready Player One | Ernest Cline | Soft | Science Fiction | 385 |
Automated Data Collection with R | Simon Munzert | Hard | Data Science | 452 |
Automated Data Collection with R | Christian Rubba | Hard | Data Science | 452 |
Automated Data Collection with R | Peter Meibner | Hard | Data Science | 452 |
Automated Data Collection with R | Dominic Nyhuis | Hard | Data Science | 452 |
Conclusions:
The three dataframes are not identical. The HTML and JSON dataframes turn out to be identical
html_df == json_df
## Book Title Author Cover Type Subject Pages
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE TRUE TRUE
However, the XML dataframe is not. The Authors are stored under one section, instead of multiple like the others.