Description of the Assignment

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”).

Libraries

library(RCurl)
library(htmltab)
library(XML)
library(jsonlite)
library(knitr)

Transforming to data frame

html_df <- data.frame(htmltab(htmlfile))
XML_df <- xmlToDataFrame(XMLfile)
json_df <- data.frame(fromJSON(jsonfile))

Outputs

kable(html_df)
Title Author Date Publisher Pages Material Condition
2 Automated Data Collection with R Simon Munzert, Christian Rubba, Peter Meibner, Dominic Nyhuis 2015 Wiley 449 Hard-bound New
3 R for Everyone Jared P. Lander 2014 Addison-Wesley 435 Paperback Used
4 Data Science for Business Foster Provost, Tom Fawcett 2013 O’Reilly 369 Paperback New
kable(XML_df)
Title Author Date Publisher Pages Material Condition
Automated Data Collection with R Simon Munzert, Christian Rubba, Peter Meibner, Dominic Nyhuis 2015 Wiley 449 Hard-bound New
R for Everyone Jared P. Lander 2014 Addison-Wesley 435 Paperback Used
Data Science for Business Foster Provost, Tom Fawcett 2013 O’Reilly 369 Paperback New
kable(json_df)
BOOKS.BOOK.Title BOOKS.BOOK.Author BOOKS.BOOK.Date BOOKS.BOOK.Publisher BOOKS.BOOK.Pages BOOKS.BOOK.Material BOOKS.BOOK.Condition
Automated Data Collection with R Simon Munzert, Christian Rubba, Peter Meibner, Dominic Nyhuis 2015 Wiley 449 Hard-bound New
R for Everyone Jared P. Lander 2014 Addison-Wesley 435 Paperback Used
Data Science for Business Foster Provost, Tom Fawcett 2013 O’Reilly 369 Paperback New

The class and contents are looking almost identical (except for column names), but inspecting the structures below, they do not seem identical.

Structures

str(html_df)
## 'data.frame':    3 obs. of  7 variables:
##  $ Title    : chr  "Automated Data Collection with R" "R for Everyone" "Data Science for Business"
##  $ Author   : chr  "Simon Munzert, Christian Rubba, Peter Meibner, Dominic Nyhuis" "Jared P. Lander" "Foster Provost, Tom Fawcett"
##  $ Date     : chr  "2015" "2014" "2013"
##  $ Publisher: chr  "Wiley" "Addison-Wesley" "O'Reilly"
##  $ Pages    : chr  "449" "435" "369"
##  $ Material : chr  "Hard-bound" "Paperback" "Paperback"
##  $ Condition: chr  "New" "Used" "New"
str(XML_df)
## 'data.frame':    3 obs. of  7 variables:
##  $ Title    : Factor w/ 3 levels "Automated Data Collection with R",..: 1 3 2
##  $ Author   : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 3 2 1
##  $ Date     : Factor w/ 3 levels "2013","2014",..: 3 2 1
##  $ Publisher: Factor w/ 3 levels "Addison-Wesley",..: 3 1 2
##  $ Pages    : Factor w/ 3 levels "369","435","449": 3 2 1
##  $ Material : Factor w/ 2 levels "Hard-bound","Paperback": 1 2 2
##  $ Condition: Factor w/ 2 levels "New","Used": 1 2 1
str(json_df)
## 'data.frame':    3 obs. of  7 variables:
##  $ BOOKS.BOOK.Title    : chr  "Automated Data Collection with R" "R for Everyone" "Data Science for Business"
##  $ BOOKS.BOOK.Author   : chr  "Simon Munzert, Christian Rubba, Peter Meibner, Dominic Nyhuis" "Jared P. Lander" "Foster Provost, Tom Fawcett"
##  $ BOOKS.BOOK.Date     : int  2015 2014 2013
##  $ BOOKS.BOOK.Publisher: chr  "Wiley" "Addison-Wesley" "O'Reilly"
##  $ BOOKS.BOOK.Pages    : int  449 435 369
##  $ BOOKS.BOOK.Material : chr  "Hard-bound" "Paperback" "Paperback"
##  $ BOOKS.BOOK.Condition: chr  "New" "Used" "New"