Assignment Instruction: Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Loading packages

#install.packages("XML")
#install.packages("jsonlite")
#install.packages("RCurl")

library(XML)
library(jsonlite)
library(RCurl)
## Loading required package: bitops

getURL method from RCurl package can help extract data from a url. readHTMLTable function from XML packagehelps in identifying and reading out html tables. In the end, we can display the data in a data frame format.

html_url <- "https://raw.githubusercontent.com/blin261/DATA607/master/Week7Assignment/Books.html"
raw_data <- getURL(html_url)
book_html <- readHTMLTable(raw_data, header = TRUE, stringsAsFactors = FALSE) 
book_html <- data.frame(book_html)
book_html
##                                   NULL.Book.Title
## 1                Automated Data Collection with R
## 2                       Data Science for Business
## 3 R for Everyone: Advanced Analytics and Graphics
##                                                        NULL.Author
## 1 Simon Munzert, Christian Rubba, Peter MeiÃ<U+0083>Â<U+009F>ner, Dominic Nyhuis
## 2                                      Foster Provost, Tom Fawcett
## 3                                                  Jared P. Lander
##               NULL.Publishers NULL.Publishing.Dates NULL.Pages
## 1                       Wiley      January 20, 2015        480
## 2              O'Reilly Media       August 19, 2013        414
## 3 Addison-Wesley Professional     December 29, 2013        464
##        NULL.ISBN NULL.Languages
## 1 978-1118834817        English
## 2 978-1449361327        English
## 3 978-0321888037        English

First step is same as we extract data from url to R. xmlParse function can parse the xml file. The top-level node is extracted with the xmlRoot() function. The resulting variable store the root for books. xmlToDataFrame can easily transform the root nodes into a data frame.

xml_url <- "https://raw.githubusercontent.com/blin261/DATA607/master/Week7Assignment/Books.xml"
raw_data <- getURL(xml_url)
xml_data <- xmlParse(raw_data)
root <- xmlRoot(xml_data)
book_xml <- xmlToDataFrame(root)
book_xml
##                                        Book_Title
## 1                Automated Data Collection with R
## 2                       Data Science for Business
## 3 R for Everyone: Advanced Analytics and Graphics
##                                                          Author
## 1 Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis
## 2                                   Foster Provost, Tom Fawcett
## 3                                               Jared P. Lander
##                    Publishers  Publishing_Dates Pages           ISBN
## 1                       Wiley  January 20, 2015   480 978-1118834817
## 2              O'Reilly Media   August 19, 2013   414 978-1449361327
## 3 Addison-Wesley Professional December 29, 2013   464 978-1449361327
##   Languages
## 1   English
## 2   English
## 3   English

Function fromJSON from jsonlite reads content in JSON format and converts it to R objects. Then it can also be easily transform into a data frame.

json_url <- "https://raw.githubusercontent.com/blin261/DATA607/master/Week7Assignment/Books.json"
json_data <- fromJSON(json_url)
book_json <- data.frame(json_data)
book_json
##                              Textbooks.Book_Title
## 1                Automated Data Collection with R
## 2                       Data Science for Business
## 3 R for Everyone: Advanced Analytics and Graphics
##                                                Textbooks.Author
## 1 Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis
## 2                                   Foster Provost, Tom Fawcett
## 3                                               Jared P. Lander
##          Textbooks.Publishers Textbooks.Publishing_Dates Textbooks.Pages
## 1                       Wiley           January 20, 2015             480
## 2              O'Reilly Media            August 19, 2013             414
## 3 Addison-Wesley Professional          December 29, 2013             464
##   Textbooks.ISBN Textbooks.Languages
## 1 978-1118834817             English
## 2 978-1449361327             English
## 3 978-1449361327             English

Conclusion: All three data frames are slightly different. The data frame generated from XML format, all the variables are factors, However, for the other two format, the data types are all characters. Json looks like more promgramming friendly. Because the original data were created like javascript object. According to the textbook (“Automated Data Collection with R”), JSON is compatible with JavaScript and can be directly parsed into JavaScript objects.

str(book_html)
## 'data.frame':    3 obs. of  7 variables:
##  $ NULL.Book.Title      : chr  "Automated Data Collection with R" "Data Science for Business" "R for Everyone: Advanced Analytics and Graphics"
##  $ NULL.Author          : chr  "Simon Munzert, Christian Rubba, Peter MeiÃ<U+0083>Â<U+009F>ner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Jared P. Lander"
##  $ NULL.Publishers      : chr  "Wiley" "O'Reilly Media" "Addison-Wesley Professional"
##  $ NULL.Publishing.Dates: chr  "January 20, 2015" "August 19, 2013" "December 29, 2013"
##  $ NULL.Pages           : chr  "480" "414" "464"
##  $ NULL.ISBN            : chr  "978-1118834817" "978-1449361327" "978-0321888037"
##  $ NULL.Languages       : chr  "English" "English" "English"
str(book_xml)
## 'data.frame':    3 obs. of  7 variables:
##  $ Book_Title      : Factor w/ 3 levels "Automated Data Collection with R",..: 1 2 3
##  $ Author          : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 3 1 2
##  $ Publishers      : Factor w/ 3 levels "Addison-Wesley Professional",..: 3 2 1
##  $ Publishing_Dates: Factor w/ 3 levels "August 19, 2013",..: 3 1 2
##  $ Pages           : Factor w/ 3 levels "414","464","480": 3 1 2
##  $ ISBN            : Factor w/ 2 levels "978-1118834817",..: 1 2 2
##  $ Languages       : Factor w/ 1 level "English": 1 1 1
str(book_json)
## 'data.frame':    3 obs. of  7 variables:
##  $ Textbooks.Book_Title      : chr  "Automated Data Collection with R" "Data Science for Business" "R for Everyone: Advanced Analytics and Graphics"
##  $ Textbooks.Author          : chr  "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Jared P. Lander"
##  $ Textbooks.Publishers      : chr  "Wiley" "O'Reilly Media" "Addison-Wesley Professional"
##  $ Textbooks.Publishing_Dates: chr  "January 20, 2015" "August 19, 2013" "December 29, 2013"
##  $ Textbooks.Pages           : chr  "480" "414" "464"
##  $ Textbooks.ISBN            : chr  "978-1118834817" "978-1449361327" "978-1449361327"
##  $ Textbooks.Languages       : chr  "English" "English" "English"