Overview

  1. Pick three of your favorite books on one of your favorite subjects.
    • At least one of the books should have more than one author.
  2. For each book, include the title, authors, and two or three other attributes that you find interesting.
  3. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. \(“books.html”, “books.xml”, and “books.json”\)).
  4. To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
  5. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.
    • Are the three data frames identical?
  6. Your deliverable is the three source files and the R code.
    • If you can, package your assignment solution up into an .Rmd file and publish to \(rpubs.com\). * [This will also require finding a way to make your three text files accessible from the web].

Import Data

Files book.html, book.xml & book.json are located in Week 7 Folder of my GitHub Repository

All url’s were stored to there respective character variable:

  • import_HTML
  • import_XML
  • import_JSON

Importing involve the following

  • HTML: read_html() from the rvest library, and is imported as class XMLInternalDocument
  • XML: xmlParse() from the XML library and is imported as class xml_document
  • JSON: fromJSON() from the jsonlite library imported as a list
import_html <-read_html(url_HTML, header = TRUE)
import_xml <- xmlParse(urlXML)
import_json <-jsonlite::fromJSON(url_JSON)

Convert to Data Frame

  • HTML: converted to first with html_table() function, the resulting tibble is then converted to a traditional data.frame with the function as.data.frame(). NOTE1\(^,\) 2
  • XML: xmlToDataFrame function from XML package.NOTE3
  • JSON do.call base function is used to utilize a function call in this operation rbind() on list. lapply() is used for performing functions on a list, in this case formatting the list into rows and columns. The combined methods create the desired data.frame
df_html<-
  as.data.frame(html_table(import_html)) %>%
    row_to_names(1) %>%
      tibble::remove_rownames()

df_xml<-xmlToDataFrame(import_xml)
colnames(df_xml)<-  str_to_title(colnames(df_xml))


df_json <- do.call("rbind", lapply(import_json, data.frame))
  rownames(df_json)<-NULL
HTML
Book Title Author(s) Year Published Publisher Price ($)
Bayesian Theory 1st Edition Jose M. Bernardo, Adrian F. M. Smith 2000 Wiley Series in Probability and Statistics 98.03
The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation 2nd Edition Christian P. Robert 2007 Springer Verlag 43.99
A First Course in Bayesian Statistical Methods 1st Edition Peter D. Hoff 2010 Springer Verlag 46.60
XML
Title Author Year_published Publisher Price
Bayesian Theory 1st Edition Jose M. Bernardo, Adrian F. M. Smith 2000 Wiley Series in Probability and Statistics 98.03
The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation 2nd Edition Christian P. Robert 2007 Springer Verlag 43.99
A First Course in Bayesian Statistical Methods 1st Edition Peter D. Hoff 2010 Springer Verlag 46.60
JSON
BookName Author YearPublished Publisher Price
Bayesian Theory 1st Edition Jose M. Bernardo, Adrian F. M. Smith 2000 Wiley Series in Probability and Statistics 98.03
The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation 2nd Edition Christian P. Robert 2007 Springer Verlag 43.99
A First Course in Bayesian Statistical Methods 1st Edition Peter D. Hoff 2010 Springer Verlag 46.60

Conclusion

Are the three data frames identical?

No, they are not. The column names are imported according the the naming conventions of where they were imported (although can be excluded or altered on import). Each requires their own library to import and the class types of each on import is distinct. As such, the approach to changing the data into a data.frame are also different.


  1. when examined with class() function, the result of html_table(import_html) is class type list.:↩︎

  2. row_to_names function used to replace column names with first row values.:↩︎

  3. str_to_title used to capitalized the first letter of each column word.:↩︎