Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Three books with six attributes are selected for this assignment. A brief view of data is as below:
| Title | Author | Publisher | Year | Edition | ISBN |
|---|---|---|---|---|---|
| Automated Data Collection with R | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | John Wiley & Sons, Ltd | 2015 | 1st | 978-1-118-83481-7 |
| Data Science for Business | Foster Provost, Tom Fawcett | O’Reilly Media, Inc | 2013 | 1st | 978-1-449-36132-7 |
| Bayesian Data Analysis | Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin | CRC Press Taylor & Francis Group | 2013 | 3rd | 978-1-4398-9820-8 |
The following packages are used in this assignment
library(tidyverse)
library(XML)
library(rvest)
library(RCurl)
library(jsonlite)
A HTML file is created with <title> ‘Three Books’ and a table under <body>. The source code is as below:
The code is interpreted by browser as below:
The following actions are performed to load the HTML table into R as dataframe:
getURL function to extract the link of the html file.read_html function.html_table function to extract a list of tables if any from the html file and convert the tables into dataframes.url <- getURL('https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-7/master/Three%20Books.htm')
df_HTML <- url %>%
read_html(encoding = 'UTF-8') %>%
html_table(header = NA, trim = TRUE) %>%
.[[1]]
df_HTML
str(df_HTML)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Bayesian Data Analysis"
## $ Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "CRC Press Taylor & Francis Group"
## $ Year : int 2015 2013 2013
## $ Edition : chr "1st" "1st" "3rd"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-4398-9820-8"
A XML file is created. All values are stored in elements except authors are sotred in attributes within tag <Author>. The source code is as below:
Firstly parse values in all elements into R dataframe.
df_XML using xmlParse function.xmlRoot function.xmlToDataFrameurl <- getURL('https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-7/master/Three%20Books.xml')
df_XML <- url %>%
xmlParse() %>%
xmlRoot() %>%
xmlToDataFrame(stringsAsFactors = FALSE)
df_XML
The column Author is empty because the element <Author> itself is empty except attributes are defined.
Therefore the attributes in <Author> is extracted seperately.
read_XML.xml_nodes.xml_attrs.attr <- url %>%
read_xml() %>%
xml_nodes(xpath = '//Author') %>%
xml_attrs() %>%
lapply(function(x) str_c(x,collapse = ', ')) %>%
unlist()
attr
## [1] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## [2] "Foster Provost, Tom Fawcett"
## [3] "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
Author in df_XML with values from the list attr.df_XML <- df_XML %>% mutate(Author = attr)
df_XML
str(df_XML)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Bayesian Data Analysis"
## $ Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "CRC Press Taylor & Francis Group"
## $ Year : chr "2015" "2013" "2013"
## $ Edition : chr "1st" "1st" "3rd"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-4398-9820-8"
A JOSN file is created. The source code is as below:
df_JSONdf_JSON to dataframeAuthor contains lists of names of authors of each book, therefore concatename those names in each list as one single string in order to fit into column ‘Author’.url <- getURL("https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-7/master/Three%20Books.json")
df_JSON <- url %>%
fromJSON() %>%
as.data.frame() %>%
rename_all(funs(str_replace(., 'three\\.books\\.',''))) %>%
mutate(Author = unlist(lapply(Author, function(x) str_c(x, collapse = ', '))))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
df_JSON
str(df_JSON)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Bayesian Data Analysis"
## $ Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "CRC Press Taylor & Francis Group"
## $ Year : int 2015 2013 2013
## $ Edition : chr "1st" "1st" "3rd"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-4398-9820-8"
These three dataframes are almost identical. There are only minor differences while parsing numeric values from source file to R dataframe.
df_HTML Vs df_XMLThe only difference is the data type of column Year. The html_table function from package rvest automatically parses numbers as numerical values, while the xmlToDataFrame function from package XML converts all values to characters if argument stringsAsFactors = FALSE. The two dataframes are identical if the data type of column Year in df_XML is converted to integer.
all.equal(df_HTML,df_XML)
## [1] "Component \"Year\": Modes: numeric, character"
## [2] "Component \"Year\": target is numeric, current is character"
all.equal(df_HTML$Year, as.integer(df_XML$Year))
## [1] TRUE
df_HTML Vs df_JSONThe two dataframes are identical. both html_table function from package rvest and fromJSON from package jsonlite are robust enough to parse values in appropriate data types.
all.equal(df_HTML,df_JSON)
## [1] TRUE
df_XML Vs df_JSONThe two dataframes are almost identical. The xmlToDataFrame function from package XML converts data type of numbers as characters if argument stringsAsFactors = FALSE, while the fromJSON from package jsonlite parses numbers as numerical values. The two dataframes are identical if the data type of column Year in df_XML is converted to integer.
all.equal(df_XML,df_JSON)
## [1] "Component \"Year\": Modes: character, numeric"
## [2] "Component \"Year\": target is character, current is numeric"
all.equal(as.integer(df_XML$Year), df_JSON$Year)
## [1] TRUE