The Task

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Load Packages

knitr::opts_chunk$set(warning=FALSE, 
                      message=FALSE,
                      tidy=F,
                      #comment = "",
                      dev="png", 
                      dev.args=list(type="cairo"))
#https://cran.r-project.org/web/packages/prettydoc/vignettes/
#https://www.rstudio.com/wp-content/uploads/2015/03/rmarkdown-reference.pdf

#create vector with all needed packages
load_packages <- c("RCurl","prettydoc", "stringr", "dplyr", "knitr", "janitor", "XML", "tidyr", "RJSONIO")

#see if we need to install any of them
new.pkg <- load_packages[!(load_packages %in% installed.packages()[, "Package"])]
if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE, warn.conflicts = FALSE)

#require
t(t(sapply(load_packages, require, character.only = TRUE, quietly = TRUE,  warn.conflicts = FALSE)))

## Warning: package 'janitor' was built under R version 3.3.3

##           [,1]
## RCurl     TRUE
## prettydoc TRUE
## stringr   TRUE
## dplyr     TRUE
## knitr     TRUE
## janitor   TRUE
## XML       TRUE
## tidyr     TRUE
## RJSONIO   TRUE

#CODE SOURCE DOCUMENTATION: https://gist.github.com/stevenworthington/3178163

1. HTML Table Parsing

Load data, look at the HTML table & then its data frame

url_html <- getURLContent("https://raw.githubusercontent.com/kylegilde/D607-Data-Acquistion/master/data-sets/books/books.html")

#my_html_df <- htmlParse(url_html)

writeLines(url_html)

## <table>
## <tr> <th>Book Title</th> <th>Author</th> <th>Cover Type</th> <th>Subject</th>    <th>Pages</th>  </tr>
## <tr> <td>Automated Data Collection with R</td>   <td>Simon Munzert</td>  <td>Hard</td>   <td>Computer Science</td>   <td>452</td>    </tr>
## <tr> <td>Automated Data Collection with R</td>   <td>Christian Rubba</td>    <td>Hard</td>   <td>Computer Science</td>   <td>452</td>    </tr>
## <tr> <td>Automated Data Collection with R</td>   <td>Peter Meibner</td>  <td>Hard</td>   <td>Computer Science</td>   <td>452</td>    </tr>
## <tr> <td>Automated Data Collection with R</td>   <td>Dominic Nyhuis</td> <td>Hard</td>   <td>Computer Science</td>   <td>452</td>    </tr>
## <tr> <td>Guns, Germs and Steel</td>  <td>Jared Diamond</td>  <td>Soft</td>   <td>History</td>    <td>498</td>    </tr>
## <tr> <td>The Signal and the Noise</td>   <td>Nate Silver</td>    <td>Soft</td>   <td>Science and Math</td>   <td>560</td>    </tr>
## </table>

my_html_df <- url_html %>% 
  readHTMLTable(header=TRUE, as.data.frame = TRUE) %>% 
  data.frame(stringsAsFactors = FALSE) %>% 
  clean_names() 

colnames(my_html_df) <- str_replace(colnames(my_html_df),"null_", "")

kable(my_html_df)

book_title	author	cover_type	subject	pages
Automated Data Collection with R	Simon Munzert	Hard	Computer Science	452
Automated Data Collection with R	Christian Rubba	Hard	Computer Science	452
Automated Data Collection with R	Peter Meibner	Hard	Computer Science	452
Automated Data Collection with R	Dominic Nyhuis	Hard	Computer Science	452
Guns, Germs and Steel	Jared Diamond	Soft	History	498
The Signal and the Noise	Nate Silver	Soft	Science and Math	560

2. JSON Parsing

Load data, look at the JSON & then its data frame

url_json <- getURLContent("https://raw.githubusercontent.com/kylegilde/D607-Data-Acquistion/master/data-sets/books/books.json")

print("Is my JSON valid?")

## [1] "Is my JSON valid?"

isValidJSON("https://raw.githubusercontent.com/kylegilde/D607-Data-Acquistion/master/data-sets/books/books.json")

## [1] TRUE

writeLines(url_json)

## {"favorite recent books":[
##  {
##  "Book Title": "Guns, Germs and Steel",
##  "Authors": "Jared Diamond",
##  "Cover Type": "Soft",
##  "Subject": "History",
##  "Pages": 498
##  },
##  {
##  "Book Title": "The Signal and the Noise",
##  "Authors": "Nate Silver",
##  "Cover Type": "Soft",
##  "Subject": "Science and Math",
##  "Pages": 560
##  },
##      {
##  "Book Title": "Automated Data Collection with R",
##  "Authors": ["Simon Munzert", "Christian Rubba", "Peter Meibner", "Dominic Nyhuis"],
##  "Cover Type": "Hard",
##  "Subject": "Computer Science",
##  "Pages": 452
##  }]
## }

my_json_df <- fromJSON(url_json)

my_json_df <- do.call("rbind", lapply(my_json_df$`favorite recent books`, data.frame, stringsAsFactors = F))

my_json_df <- my_json_df %>%  
  clean_names() %>% 
  arrange(book_title)

kable(my_json_df, caption = "This data frame looks the same as the HTML one.")

This data frame looks the same as the HTML one.
book_title	authors	cover_type	subject	pages
Automated Data Collection with R	Simon Munzert	Hard	Computer Science	452
Automated Data Collection with R	Christian Rubba	Hard	Computer Science	452
Automated Data Collection with R	Peter Meibner	Hard	Computer Science	452
Automated Data Collection with R	Dominic Nyhuis	Hard	Computer Science	452
Guns, Germs and Steel	Jared Diamond	Soft	History	498
The Signal and the Noise	Nate Silver	Soft	Science and Math	560

3. XML Parsing

Load data, look at the XML & then its data frame

url_XML <- getURLContent("https://raw.githubusercontent.com/kylegilde/D607-Data-Acquistion/master/data-sets/books/books2.xml")

writeLines(url_XML)

## <?xml version="1.0" encoding="UTF-8" ?>
## <!--These are some of my recent favorite books-->
## <books>
##  <book>
##      <book_title>Automated Data Collection with R</book_title>
##      <authors>
##          <author ID="1">Simon Munzert</author>
##          <author ID="2">Christian Rubba</author>
##          <author ID="3">Peter Meibner</author>
##          <author ID="4">Dominic Nyhuis</author>
##      </authors>
##      <cover_type>Hard</cover_type>
##      <subject>Computer Science</subject>
##      <pages>452</pages>
##  </book>
##  <book>      
##      <book_title>Guns, Germs and Steel</book_title>
##      <authors>
##          <author ID="1">Jared Diamond</author>
##      </authors>
##      <cover_type>Soft</cover_type>
##      <subject>History</subject>
##      <pages>498</pages>
##  </book>     
##  <book>      
##      <book_title>The Signal and the Noise</book_title>
##      <authors>
##          <author ID="1">Nate Silver</author>
##      </authors>
##      <cover_type>Soft</cover_type>
##      <subject>Science and Math</subject>
##      <pages>560</pages>
##  </book>     
## </books>

my_XML_df <- url_XML %>% 
  xmlParse() %>% 
  xmlToDataFrame(stringsAsFactors = FALSE) 

kable(my_XML_df, caption = "This does not look the same as the first 2 data frames. For the book with more than one author, the function concatenated all of them into a single cell. Let's do a little bit of surgery to get the same result.")

This does not look the same as the first 2 data frames. For the book with more than one author, the function concatenated all of them into a single cell. Let’s do a little bit of surgery to get the same result.
book_title	authors	cover_type	subject	pages
Automated Data Collection with R	Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis	Hard	Computer Science	452
Guns, Germs and Steel	Jared Diamond	Soft	History	498
The Signal and the Noise	Nate Silver	Soft	Science and Math	560

my_XML_df2 <- my_XML_df %>% 
  mutate(authors = paste(str_replace_all(authors, "([a-z])([A-Z])", "\\1,\\2"))) %>% 
  separate(authors, c(paste0("author_", 1:4)), sep = ",") %>% 
  gather(author_num, author, author_1:author_4, na.rm = T) %>% 
  select(book_title, author, everything(), -author_num) %>% 
  arrange(book_title) 
  
kable(my_XML_df2, caption = "And now it looks like the other 2 data frames")

And now it looks like the other 2 data frames
book_title	author	cover_type	subject	pages
Automated Data Collection with R	Simon Munzert	Hard	Computer Science	452
Automated Data Collection with R	Christian Rubba	Hard	Computer Science	452
Automated Data Collection with R	Peter Meibner	Hard	Computer Science	452
Automated Data Collection with R	Dominic Nyhuis	Hard	Computer Science	452
Guns, Germs and Steel	Jared Diamond	Soft	History	498
The Signal and the Noise	Nate Silver	Soft	Science and Math	560

After reworking the XML DF, all data frames now are the same

my_json_df == my_html_df

##      book_title authors cover_type subject pages
## [1,]       TRUE    TRUE       TRUE    TRUE  TRUE
## [2,]       TRUE    TRUE       TRUE    TRUE  TRUE
## [3,]       TRUE    TRUE       TRUE    TRUE  TRUE
## [4,]       TRUE    TRUE       TRUE    TRUE  TRUE
## [5,]       TRUE    TRUE       TRUE    TRUE  TRUE
## [6,]       TRUE    TRUE       TRUE    TRUE  TRUE

my_html_df == my_XML_df2

##      book_title author cover_type subject pages
## [1,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [2,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [3,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [4,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [5,]       TRUE   TRUE       TRUE    TRUE  TRUE
## [6,]       TRUE   TRUE       TRUE    TRUE  TRUE

D607 Wk07 HW - Parsing HTML, XML & JSON

Kyle Gilde

Mar. 14, 2017