Overview

Pick three of your favorite books on one of your favorite subjects.
- At least one of the books should have more than one author.
For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. $“books.html”, “books.xml”, and “books.json”$).
To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.
- Are the three data frames identical?
Your deliverable is the three source files and the R code.
- If you can, package your assignment solution up into an .Rmd file and publish to $rpubs.com$. * [This will also require finding a way to make your three text files accessible from the web].

Import Data

Files book.html, book.xml & book.json are located in Week 7 Folder of my GitHub Repository

All url’s were stored to there respective character variable:

import_HTML
import_XML
import_JSON

Importing involve the following

HTML: read_html() from the rvest library, and is imported as class XMLInternalDocument
XML: xmlParse() from the XML library and is imported as class xml_document
JSON: fromJSON() from the jsonlite library imported as a list

import_html <-read_html(url_HTML, header = TRUE)
import_xml <- xmlParse(urlXML)
import_json <-jsonlite::fromJSON(url_JSON)

Convert to Data Frame

HTML: converted to first with html_table() function, the resulting tibble is then converted to a traditional data.frame with the function as.data.frame(). NOTE¹$^,$ ²
XML: xmlToDataFrame function from XML package.NOTE³
JSON do.call base function is used to utilize a function call in this operation rbind() on list. lapply() is used for performing functions on a list, in this case formatting the list into rows and columns. The combined methods create the desired data.frame

df_html<-
  as.data.frame(html_table(import_html)) %>%
    row_to_names(1) %>%
      tibble::remove_rownames()

df_xml<-xmlToDataFrame(import_xml)
colnames(df_xml)<-  str_to_title(colnames(df_xml))


df_json <- do.call("rbind", lapply(import_json, data.frame))
  rownames(df_json)<-NULL

HTML
Book Title	Author(s)	Year Published	Publisher	Price ($)
Bayesian Theory 1st Edition	Jose M. Bernardo, Adrian F. M. Smith	2000	Wiley Series in Probability and Statistics	98.03
The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation 2nd Edition	Christian P. Robert	2007	Springer Verlag	43.99
A First Course in Bayesian Statistical Methods 1st Edition	Peter D. Hoff	2010	Springer Verlag	46.60

XML
Title	Author	Year_published	Publisher	Price
Bayesian Theory 1st Edition	Jose M. Bernardo, Adrian F. M. Smith	2000	Wiley Series in Probability and Statistics	98.03
The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation 2nd Edition	Christian P. Robert	2007	Springer Verlag	43.99
A First Course in Bayesian Statistical Methods 1st Edition	Peter D. Hoff	2010	Springer Verlag	46.60

JSON
BookName	Author	YearPublished	Publisher	Price
Bayesian Theory 1st Edition	Jose M. Bernardo, Adrian F. M. Smith	2000	Wiley Series in Probability and Statistics	98.03
The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation 2nd Edition	Christian P. Robert	2007	Springer Verlag	43.99
A First Course in Bayesian Statistical Methods 1st Edition	Peter D. Hoff	2010	Springer Verlag	46.60

Conclusion

Are the three data frames identical?

No, they are not. The column names are imported according the the naming conventions of where they were imported (although can be excluded or altered on import). Each requires their own library to import and the class types of each on import is distinct. As such, the approach to changing the data into a data.frame are also different.

when examined with class() function, the result of html_table(import_html) is class type list.:↩︎
row_to_names function used to replace column names with first row values.:↩︎
str_to_title used to capitalized the first letter of each column word.:↩︎

DATA607 Week 7 Assignment

Gabriel Campos

March 21 2021

Overview

Import Data

Convert to Data Frame

Conclusion