Assignment Instruction: Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Loading packages
#install.packages("XML")
#install.packages("jsonlite")
#install.packages("RCurl")
library(XML)
library(jsonlite)
library(RCurl)
## Loading required package: bitops
getURL method from RCurl package can help extract data from a url. readHTMLTable function from XML packagehelps in identifying and reading out html tables. In the end, we can display the data in a data frame format.
html_url <- "https://raw.githubusercontent.com/blin261/DATA607/master/Week7Assignment/Books.html"
raw_data <- getURL(html_url)
book_html <- readHTMLTable(raw_data, header = TRUE, stringsAsFactors = FALSE)
book_html <- data.frame(book_html)
book_html
## NULL.Book.Title
## 1 Automated Data Collection with R
## 2 Data Science for Business
## 3 R for Everyone: Advanced Analytics and Graphics
## NULL.Author
## 1 Simon Munzert, Christian Rubba, Peter MeiÃ<U+0083>Â<U+009F>ner, Dominic Nyhuis
## 2 Foster Provost, Tom Fawcett
## 3 Jared P. Lander
## NULL.Publishers NULL.Publishing.Dates NULL.Pages
## 1 Wiley January 20, 2015 480
## 2 O'Reilly Media August 19, 2013 414
## 3 Addison-Wesley Professional December 29, 2013 464
## NULL.ISBN NULL.Languages
## 1 978-1118834817 English
## 2 978-1449361327 English
## 3 978-0321888037 English
First step is same as we extract data from url to R. xmlParse function can parse the xml file. The top-level node is extracted with the xmlRoot() function. The resulting variable store the root for books. xmlToDataFrame can easily transform the root nodes into a data frame.
xml_url <- "https://raw.githubusercontent.com/blin261/DATA607/master/Week7Assignment/Books.xml"
raw_data <- getURL(xml_url)
xml_data <- xmlParse(raw_data)
root <- xmlRoot(xml_data)
book_xml <- xmlToDataFrame(root)
book_xml
## Book_Title
## 1 Automated Data Collection with R
## 2 Data Science for Business
## 3 R for Everyone: Advanced Analytics and Graphics
## Author
## 1 Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis
## 2 Foster Provost, Tom Fawcett
## 3 Jared P. Lander
## Publishers Publishing_Dates Pages ISBN
## 1 Wiley January 20, 2015 480 978-1118834817
## 2 O'Reilly Media August 19, 2013 414 978-1449361327
## 3 Addison-Wesley Professional December 29, 2013 464 978-1449361327
## Languages
## 1 English
## 2 English
## 3 English
Function fromJSON from jsonlite reads content in JSON format and converts it to R objects. Then it can also be easily transform into a data frame.
json_url <- "https://raw.githubusercontent.com/blin261/DATA607/master/Week7Assignment/Books.json"
json_data <- fromJSON(json_url)
book_json <- data.frame(json_data)
book_json
## Textbooks.Book_Title
## 1 Automated Data Collection with R
## 2 Data Science for Business
## 3 R for Everyone: Advanced Analytics and Graphics
## Textbooks.Author
## 1 Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis
## 2 Foster Provost, Tom Fawcett
## 3 Jared P. Lander
## Textbooks.Publishers Textbooks.Publishing_Dates Textbooks.Pages
## 1 Wiley January 20, 2015 480
## 2 O'Reilly Media August 19, 2013 414
## 3 Addison-Wesley Professional December 29, 2013 464
## Textbooks.ISBN Textbooks.Languages
## 1 978-1118834817 English
## 2 978-1449361327 English
## 3 978-1449361327 English
Conclusion: All three data frames are slightly different. The data frame generated from XML format, all the variables are factors, However, for the other two format, the data types are all characters. Json looks like more promgramming friendly. Because the original data were created like javascript object. According to the textbook (“Automated Data Collection with R”), JSON is compatible with JavaScript and can be directly parsed into JavaScript objects.
str(book_html)
## 'data.frame': 3 obs. of 7 variables:
## $ NULL.Book.Title : chr "Automated Data Collection with R" "Data Science for Business" "R for Everyone: Advanced Analytics and Graphics"
## $ NULL.Author : chr "Simon Munzert, Christian Rubba, Peter MeiÃ<U+0083>Â<U+009F>ner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Jared P. Lander"
## $ NULL.Publishers : chr "Wiley" "O'Reilly Media" "Addison-Wesley Professional"
## $ NULL.Publishing.Dates: chr "January 20, 2015" "August 19, 2013" "December 29, 2013"
## $ NULL.Pages : chr "480" "414" "464"
## $ NULL.ISBN : chr "978-1118834817" "978-1449361327" "978-0321888037"
## $ NULL.Languages : chr "English" "English" "English"
str(book_xml)
## 'data.frame': 3 obs. of 7 variables:
## $ Book_Title : Factor w/ 3 levels "Automated Data Collection with R",..: 1 2 3
## $ Author : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 3 1 2
## $ Publishers : Factor w/ 3 levels "Addison-Wesley Professional",..: 3 2 1
## $ Publishing_Dates: Factor w/ 3 levels "August 19, 2013",..: 3 1 2
## $ Pages : Factor w/ 3 levels "414","464","480": 3 1 2
## $ ISBN : Factor w/ 2 levels "978-1118834817",..: 1 2 2
## $ Languages : Factor w/ 1 level "English": 1 1 1
str(book_json)
## 'data.frame': 3 obs. of 7 variables:
## $ Textbooks.Book_Title : chr "Automated Data Collection with R" "Data Science for Business" "R for Everyone: Advanced Analytics and Graphics"
## $ Textbooks.Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Jared P. Lander"
## $ Textbooks.Publishers : chr "Wiley" "O'Reilly Media" "Addison-Wesley Professional"
## $ Textbooks.Publishing_Dates: chr "January 20, 2015" "August 19, 2013" "December 29, 2013"
## $ Textbooks.Pages : chr "480" "414" "464"
## $ Textbooks.ISBN : chr "978-1118834817" "978-1449361327" "978-1449361327"
## $ Textbooks.Languages : chr "English" "English" "English"