Data 607_Hw7_Working with XML and JSON in R

Week 7 Assignment Description

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Data - Three Books

Title	Author	Publisher	Year	Edition	ISBN
Automated Data Collection with R	Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis	John Wiley & Sons, Ltd	2015	1st	978-1-118-83481-7
Data Science for Business	Foster Provost, Tom Fawcett	O’Reilly Media, Inc	2013	1st	978-1-449-36132-7
Text Mining with R: A Tidy Approach	Julia Silge, David Robinson	O’Reilly Media, Inc	2017	1st	978-1-491-98165-8

library(tidyverse)
library(XML)
library(rvest)
library(RCurl)
library(jsonlite)
library(kableExtra)

Convert HTML to R Dataframe

The source file is as of below:

url <- getURL('https://raw.githubusercontent.com/shirley-wong/Data-607/master/Three_Books.htm')
HTML_data <- htmlParse(url)
HTML_data

## <!DOCTYPE html>
## <html>
## <head><title>Three Books</title></head>
## <body>
##  <table>
## <tr>
## <th>Title</th>
##          <th>Authors</th>
##          <th>Publisher</th>
##          <th>Year</th>
##          <th>Edition</th>
##          <th>ISBN</th> 
##          </tr>
## <tr>
## <td>Automated Data Collection with R</td>
##          <td>Simon Munzert, Christian Rubba, Peter MeiÃƒÂŸner, Dominic Nyhuis</td>
##          <td>John Wiley &amp; Sons, Ltd</td>
##          <td>2015</td>
##          <td>1st</td>
##          <td>978-1-118-83481-7</td>
##          </tr>
## <tr>
## <td>Data Science for Business</td>
##          <td>Foster Provost, Tom Fawcett</td>
##          <td>OÃ¢Â€Â™Reilly Media, Inc</td>
##          <td>2013</td>
##          <td>1st</td>
##          <td>978-1-449-36132-7</td>
##      </tr>
## <tr>
## <td>Text Mining with R: A Tidy Approach</td>
##          <td>Julia Silge, David Robinson</td>
##          <td>OÃ¢Â€Â™Reilly Media, Inc</td>
##          <td>2017</td>
##          <td>1st</td>
##          <td>978-1-491-98165-8</td>
##      </tr>
## </table>
## </body>
## </html>
##

Load HTML data into R as dataframe using rvest Package:

HTML_df <- url %>%
  read_html(encoding = 'UTF-8') %>%        # read url link for HTML data into R as a list
  html_table(header = NA, trim = TRUE) %>% # convert the file to a list of dataframes
  .[[1]]                                   # Get the first element 

kable(HTML_df)

Title	Authors	Publisher	Year	Edition	ISBN
Automated Data Collection with R	Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis	John Wiley & Sons, Ltd	2015	1st	978-1-118-83481-7
Data Science for Business	Foster Provost, Tom Fawcett	O’Reilly Media, Inc	2013	1st	978-1-449-36132-7
Text Mining with R: A Tidy Approach	Julia Silge, David Robinson	O’Reilly Media, Inc	2017	1st	978-1-491-98165-8

str(HTML_df)

## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
##  $ Authors  : chr  "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
##  $ Publisher: chr  "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
##  $ Year     : int  2015 2013 2017
##  $ Edition  : chr  "1st" "1st" "1st"
##  $ ISBN     : chr  "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"

Convert XML to R Dataframe

The source file of is as below:

url <- getURL('https://raw.githubusercontent.com/shirley-wong/Data-607/master/Three_Books.xml')
XML_data <- xmlParse(url)
XML_data

## <?xml version="1.0" encoding="UTF-8"?>
## <three_books>
##   <book id="1">
##     <Title>Automated Data Collection with R</Title>
##     <Authors>
##       <Author ID="1">Simon Munzert</Author>
##       <Author ID="2">Christian Rubba</Author>
##       <Author ID="3">Peter MeiÃŸner</Author>
##       <Author ID="4">Dominic Nyhuis</Author>
##     </Authors>
##     <Publisher>John Wiley &amp; Sons, Ltd</Publisher>
##     <Year>2015</Year>
##     <Edition>1st</Edition>
##     <ISBN>978-1-118-83481-7</ISBN>
##   </book>
##   <book id="2">
##     <Title>Data Science for Business</Title>
##     <Authors>
##       <Author ID="1">Foster Provost</Author>
##       <Author ID="2">Tom Fawcett</Author>
##     </Authors>
##     <Publisher>Oâ€™Reilly Media, Inc</Publisher>
##     <Year>2013</Year>
##     <Edition>1st</Edition>
##     <ISBN>978-1-449-36132-7</ISBN>
##   </book>
##   <book id="3">
##     <Title>Text Mining with R: A Tidy Approach</Title>
##     <Authors>
##       <Author ID="1">Julia Silge</Author>
##       <Author ID="2">David Robinson</Author>
##     </Authors>
##     <Publisher>Oâ€™Reilly Media, Inc</Publisher>
##     <Year>2017</Year>
##     <Edition>1st</Edition>
##     <ISBN>978-1-491-98165-8</ISBN>
##   </book>
## </three_books>
##

Load XML data into R as dataframe using XML Package:

XML_df <- url %>%
  xmlParse() %>%                                #read url link for XML data into R as a list
  xmlRoot() %>%                                 #get the root node of XML data
  xmlToDataFrame(stringsAsFactors = FALSE) %>%  #convert the XML data to dataframe
  mutate(Year=as.integer(Year))
kable(XML_df)

Title	Authors	Publisher	Year	Edition	ISBN
Automated Data Collection with R	Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis	John Wiley & Sons, Ltd	2015	1st	978-1-118-83481-7
Data Science for Business	Foster ProvostTom Fawcett	O’Reilly Media, Inc	2013	1st	978-1-449-36132-7
Text Mining with R: A Tidy Approach	Julia SilgeDavid Robinson	O’Reilly Media, Inc	2017	1st	978-1-491-98165-8

str(XML_df)

## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
##  $ Authors  : chr  "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis" "Foster ProvostTom Fawcett" "Julia SilgeDavid Robinson"
##  $ Publisher: chr  "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
##  $ Year     : int  2015 2013 2017
##  $ Edition  : chr  "1st" "1st" "1st"
##  $ ISBN     : chr  "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"

Convert JSON to R Dataframe

The source file of is as below:

‘JSON Source File’

Load XML data into R as dataframe using jsonlite Package:

url <- getURL("https://raw.githubusercontent.com/shirley-wong/Data-607/master/Three_Books.json")
JSON_df <- url %>%
  fromJSON() %>%    #read JSON file
  .[[1]] %>%  #get the first element from the list which is the dataframe we are looking for
  mutate(Authors = unlist(lapply(Authors, function(x) str_c(x, collapse = ', ')))) #get the values in the lists of Authors column and fit them into dataframe

kable(JSON_df)

Title	Authors	Publisher	Year	Edition	ISBN
Automated Data Collection with R	Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis	John Wiley & Sons, Ltd	2015	1st	978-1-118-83481-7
Data Science for Business	Foster Provost, Tom Fawcett	O’Reilly Media, Inc	2013	1st	978-1-449-36132-7
Text Mining with R: A Tidy Approach	Julia Silge, David Robinson	O’Reilly Media, Inc	2017	1st	978-1-491-98165-8

str(JSON_df)

## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
##  $ Authors  : chr  "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
##  $ Publisher: chr  "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
##  $ Year     : int  2015 2013 2017
##  $ Edition  : chr  "1st" "1st" "1st"
##  $ ISBN     : chr  "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"

Comparison

1. Between HTML and XML

The two dataframes converted from HTML file and XML file are not exactly the same. The original data in element <table> in HTML file are completely and accurately parsed into R dataframe, however the original data in element <Authors> are parsed and concated without delimiters.

all.equal(HTML_df,XML_df)

## [1] "Component \"Authors\": 3 string mismatches"

rbind(HTML_df$Authors, XML_df$Authors)

##      [,1]                                                           
## [1,] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## [2,] "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis"      
##      [,2]                          [,3]                         
## [1,] "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
## [2,] "Foster ProvostTom Fawcett"   "Julia SilgeDavid Robinson"

2. Between HTML and JSON

The two dataframes are identical.

all.equal(HTML_df,JSON_df)

## [1] TRUE

3. Between XML and JSON

The two dataframe converted from XML file and JSON file are not exactly the same. The original data in element <Authors> are parsed and concated without delimiters, however the original data in element “Authors” are parsed and concated with ‘,’ as delimiters.

all.equal(XML_df,JSON_df)

## [1] "Component \"Authors\": 3 string mismatches"

rbind(XML_df$Authors, JSON_df$Authors)

##      [,1]                                                           
## [1,] "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis"      
## [2,] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
##      [,2]                          [,3]                         
## [1,] "Foster ProvostTom Fawcett"   "Julia SilgeDavid Robinson"  
## [2,] "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"