Week 7 Assignment Description

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Data - Three Books

Title Author Publisher Year Edition ISBN
Automated Data Collection with R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis John Wiley & Sons, Ltd 2015 1st 978-1-118-83481-7
Data Science for Business Foster Provost, Tom Fawcett O’Reilly Media, Inc 2013 1st 978-1-449-36132-7
Text Mining with R: A Tidy Approach Julia Silge, David Robinson O’Reilly Media, Inc 2017 1st 978-1-491-98165-8

Convert HTML to R Dataframe

  • The source file is as of below:
## <!DOCTYPE html>
## <html>
## <head><title>Three Books</title></head>
## <body>
##  <table>
## <tr>
## <th>Title</th>
##          <th>Authors</th>
##          <th>Publisher</th>
##          <th>Year</th>
##          <th>Edition</th>
##          <th>ISBN</th> 
##          </tr>
## <tr>
## <td>Automated Data Collection with R</td>
##          <td>Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis</td>
##          <td>John Wiley &amp; Sons, Ltd</td>
##          <td>2015</td>
##          <td>1st</td>
##          <td>978-1-118-83481-7</td>
##          </tr>
## <tr>
## <td>Data Science for Business</td>
##          <td>Foster Provost, Tom Fawcett</td>
##          <td>O’Reilly Media, Inc</td>
##          <td>2013</td>
##          <td>1st</td>
##          <td>978-1-449-36132-7</td>
##      </tr>
## <tr>
## <td>Text Mining with R: A Tidy Approach</td>
##          <td>Julia Silge, David Robinson</td>
##          <td>O’Reilly Media, Inc</td>
##          <td>2017</td>
##          <td>1st</td>
##          <td>978-1-491-98165-8</td>
##      </tr>
## </table>
## </body>
## </html>
## 
  • Load HTML data into R as dataframe using rvest Package:
Title Authors Publisher Year Edition ISBN
Automated Data Collection with R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis John Wiley & Sons, Ltd 2015 1st 978-1-118-83481-7
Data Science for Business Foster Provost, Tom Fawcett O’Reilly Media, Inc 2013 1st 978-1-449-36132-7
Text Mining with R: A Tidy Approach Julia Silge, David Robinson O’Reilly Media, Inc 2017 1st 978-1-491-98165-8
## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
##  $ Authors  : chr  "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
##  $ Publisher: chr  "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
##  $ Year     : int  2015 2013 2017
##  $ Edition  : chr  "1st" "1st" "1st"
##  $ ISBN     : chr  "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"

Convert XML to R Dataframe

  • The source file of is as below:
## <?xml version="1.0" encoding="UTF-8"?>
## <three_books>
##   <book id="1">
##     <Title>Automated Data Collection with R</Title>
##     <Authors>
##       <Author ID="1">Simon Munzert</Author>
##       <Author ID="2">Christian Rubba</Author>
##       <Author ID="3">Peter Meißner</Author>
##       <Author ID="4">Dominic Nyhuis</Author>
##     </Authors>
##     <Publisher>John Wiley &amp; Sons, Ltd</Publisher>
##     <Year>2015</Year>
##     <Edition>1st</Edition>
##     <ISBN>978-1-118-83481-7</ISBN>
##   </book>
##   <book id="2">
##     <Title>Data Science for Business</Title>
##     <Authors>
##       <Author ID="1">Foster Provost</Author>
##       <Author ID="2">Tom Fawcett</Author>
##     </Authors>
##     <Publisher>O’Reilly Media, Inc</Publisher>
##     <Year>2013</Year>
##     <Edition>1st</Edition>
##     <ISBN>978-1-449-36132-7</ISBN>
##   </book>
##   <book id="3">
##     <Title>Text Mining with R: A Tidy Approach</Title>
##     <Authors>
##       <Author ID="1">Julia Silge</Author>
##       <Author ID="2">David Robinson</Author>
##     </Authors>
##     <Publisher>O’Reilly Media, Inc</Publisher>
##     <Year>2017</Year>
##     <Edition>1st</Edition>
##     <ISBN>978-1-491-98165-8</ISBN>
##   </book>
## </three_books>
## 
  • Load XML data into R as dataframe using XML Package:
Title Authors Publisher Year Edition ISBN
Automated Data Collection with R Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis John Wiley & Sons, Ltd 2015 1st 978-1-118-83481-7
Data Science for Business Foster ProvostTom Fawcett O’Reilly Media, Inc 2013 1st 978-1-449-36132-7
Text Mining with R: A Tidy Approach Julia SilgeDavid Robinson O’Reilly Media, Inc 2017 1st 978-1-491-98165-8
## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
##  $ Authors  : chr  "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis" "Foster ProvostTom Fawcett" "Julia SilgeDavid Robinson"
##  $ Publisher: chr  "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
##  $ Year     : int  2015 2013 2017
##  $ Edition  : chr  "1st" "1st" "1st"
##  $ ISBN     : chr  "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"

Convert JSON to R Dataframe

  • The source file of is as below:
‘JSON Source File’

‘JSON Source File’

  • Load XML data into R as dataframe using jsonlite Package:
Title Authors Publisher Year Edition ISBN
Automated Data Collection with R Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis John Wiley & Sons, Ltd 2015 1st 978-1-118-83481-7
Data Science for Business Foster Provost, Tom Fawcett O’Reilly Media, Inc 2013 1st 978-1-449-36132-7
Text Mining with R: A Tidy Approach Julia Silge, David Robinson O’Reilly Media, Inc 2017 1st 978-1-491-98165-8
## 'data.frame':    3 obs. of  6 variables:
##  $ Title    : chr  "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
##  $ Authors  : chr  "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
##  $ Publisher: chr  "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
##  $ Year     : int  2015 2013 2017
##  $ Edition  : chr  "1st" "1st" "1st"
##  $ ISBN     : chr  "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"

Comparison

1. Between HTML and XML

The two dataframes converted from HTML file and XML file are not exactly the same. The original data in element <table> in HTML file are completely and accurately parsed into R dataframe, however the original data in element <Authors> are parsed and concated without delimiters.

## [1] "Component \"Authors\": 3 string mismatches"
##      [,1]                                                           
## [1,] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## [2,] "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis"      
##      [,2]                          [,3]                         
## [1,] "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
## [2,] "Foster ProvostTom Fawcett"   "Julia SilgeDavid Robinson"

2. Between HTML and JSON

The two dataframes are identical.

## [1] TRUE

3. Between XML and JSON

The two dataframe converted from XML file and JSON file are not exactly the same. The original data in element <Authors> are parsed and concated without delimiters, however the original data in element “Authors” are parsed and concated with ‘,’ as delimiters.

## [1] "Component \"Authors\": 3 string mismatches"
##      [,1]                                                           
## [1,] "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis"      
## [2,] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
##      [,2]                          [,3]                         
## [1,] "Foster ProvostTom Fawcett"   "Julia SilgeDavid Robinson"  
## [2,] "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"