Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
Three books with six attributes are selected for this assignment. A brief view of data is as below:
Title | Author | Publisher | Year | Edition | ISBN |
---|---|---|---|---|---|
Automated Data Collection with R | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | John Wiley & Sons, Ltd | 2015 | 1st | 978-1-118-83481-7 |
Data Science for Business | Foster Provost, Tom Fawcett | O’Reilly Media, Inc | 2013 | 1st | 978-1-449-36132-7 |
Bayesian Data Analysis | Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin | CRC Press Taylor & Francis Group | 2013 | 3rd | 978-1-4398-9820-8 |
The following packages are used in this assignment
library(tidyverse)
library(XML)
library(rvest)
library(RCurl)
library(jsonlite)
A HTML file is created with <title> ‘Three Books’ and a table under <body>. The source code is as below:
The code is interpreted by browser as below:
The following actions are performed to load the HTML table into R as dataframe:
getURL
function to extract the link of the html file.read_html
function.html_table
function to extract a list of tables if any from the html file and convert the tables into dataframes.url <- getURL('https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-7/master/Three%20Books.htm')
df_HTML <- url %>%
read_html(encoding = 'UTF-8') %>%
html_table(header = NA, trim = TRUE) %>%
.[[1]]
df_HTML
str(df_HTML)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Bayesian Data Analysis"
## $ Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "CRC Press Taylor & Francis Group"
## $ Year : int 2015 2013 2013
## $ Edition : chr "1st" "1st" "3rd"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-4398-9820-8"
A XML file is created. All values are stored in elements except authors are sotred in attributes within tag <Author>. The source code is as below:
Firstly parse values in all elements into R dataframe.
df_XML
using xmlParse
function.xmlRoot
function.xmlToDataFrame
url <- getURL('https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-7/master/Three%20Books.xml')
df_XML <- url %>%
xmlParse() %>%
xmlRoot() %>%
xmlToDataFrame(stringsAsFactors = FALSE)
df_XML
The column Author
is empty because the element <Author> itself is empty except attributes are defined.
Therefore the attributes in <Author> is extracted seperately.
read_XML
.xml_nodes
.xml_attrs
.attr <- url %>%
read_xml() %>%
xml_nodes(xpath = '//Author') %>%
xml_attrs() %>%
lapply(function(x) str_c(x,collapse = ', ')) %>%
unlist()
attr
## [1] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## [2] "Foster Provost, Tom Fawcett"
## [3] "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
Author
in df_XML
with values from the list attr
.df_XML <- df_XML %>% mutate(Author = attr)
df_XML
str(df_XML)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Bayesian Data Analysis"
## $ Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "CRC Press Taylor & Francis Group"
## $ Year : chr "2015" "2013" "2013"
## $ Edition : chr "1st" "1st" "3rd"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-4398-9820-8"
A JOSN file is created. The source code is as below:
df_JSON
df_JSON
to dataframeAuthor
contains lists of names of authors of each book, therefore concatename those names in each list as one single string in order to fit into column ‘Author’.url <- getURL("https://raw.githubusercontent.com/oggyluky11/DATA607-Assignment-7/master/Three%20Books.json")
df_JSON <- url %>%
fromJSON() %>%
as.data.frame() %>%
rename_all(funs(str_replace(., 'three\\.books\\.',''))) %>%
mutate(Author = unlist(lapply(Author, function(x) str_c(x, collapse = ', '))))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
df_JSON
str(df_JSON)
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Bayesian Data Analysis"
## $ Author : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "CRC Press Taylor & Francis Group"
## $ Year : int 2015 2013 2013
## $ Edition : chr "1st" "1st" "3rd"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-4398-9820-8"
These three dataframes are almost identical. There are only minor differences while parsing numeric values from source file to R dataframe.
df_HTML
Vs df_XML
The only difference is the data type of column Year
. The html_table
function from package rvest
automatically parses numbers as numerical values, while the xmlToDataFrame
function from package XML
converts all values to characters if argument stringsAsFactors
= FALSE. The two dataframes are identical if the data type of column Year
in df_XML
is converted to integer.
all.equal(df_HTML,df_XML)
## [1] "Component \"Year\": Modes: numeric, character"
## [2] "Component \"Year\": target is numeric, current is character"
all.equal(df_HTML$Year, as.integer(df_XML$Year))
## [1] TRUE
df_HTML
Vs df_JSON
The two dataframes are identical. both html_table
function from package rvest
and fromJSON
from package jsonlite
are robust enough to parse values in appropriate data types.
all.equal(df_HTML,df_JSON)
## [1] TRUE
df_XML
Vs df_JSON
The two dataframes are almost identical. The xmlToDataFrame
function from package XML
converts data type of numbers as characters if argument stringsAsFactors
= FALSE, while the fromJSON
from package jsonlite
parses numbers as numerical values. The two dataframes are identical if the data type of column Year
in df_XML
is converted to integer.
all.equal(df_XML,df_JSON)
## [1] "Component \"Year\": Modes: character, numeric"
## [2] "Component \"Year\": target is character, current is numeric"
all.equal(as.integer(df_XML$Year), df_JSON$Year)
## [1] TRUE