Task

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

#load package
library(XML)
library(DT)
library(RJSONIO)
library(tidyverse)

Working with the HTML file

We built the HTML file and uploaded to github. The code can be accessed from github and the rsulting html is shown below:

‘code’ <!DOCTYPE html> Books on R
Book Author Subject ISBN 13 Price $
R for Data Science
Hadley Wickham,
Garrett Grolemund
Basic R 978-1491910399 38.20
Art of R Programming Norman Matloff R Programming 978-1593273842 20.16
Machine Learning with R Brett Lantz R Machine Learning 978-1784393908 44.93

‘code’

We then downloaded the file from github and parsed it, then we used readHTMLTable to read it into a table.

#read path
path<-"https://raw.githubusercontent.com/zahirf/Data607/master/books.html"
#download
download.file(path, destfile = "C:/Users/zahir/Documents/Data 607/Week 07/books.html")
#parse
parsed<-htmlParse(file = "C:/Users/zahir/Documents/Data 607/Week 07/books.html")
#read into table
html_table<-readHTMLTable(parsed, which=1, stringsAsFactors = FALSE)
#view
datatable(html_table)

Working with XML

The XML code to build the file can be accessed at githuband is shown below:

‘code’

R for Data Science
    <Author1>Hadley Wickham,</Author1> 
    <Author2>Garrett Grolemund</Author2>
    <Subject>Basic R</Subject>
    <ISBN>978-1491910399</ISBN>
    <Price>38.20</Price>  
</book>
<book id="2">
    <Title>Art of R Programming</Title>
    <Author1>Norman Matloff</Author1>
    <Subject>R Programming</Subject>
    <ISBN>978-1593273842</ISBN>
    <Price>20.16</Price> 
</book>
<book id="3">
    <Title>Machine Learning with R</Title>
    <Author1>Brett Lantz</Author1>
    <Subject>R Machine Learning</Subject>
    <ISBN>978-1784393908</ISBN>
    <Price>44.93</Price> 
</book>

‘code’

We did the same steps as with the HTML file but used xmlToDataFrame to read into table this time

#path
path<-"https://raw.githubusercontent.com/zahirf/Data607/master/books.xml"
#download
download.file(path, destfile = "C:/Users/zahir/Documents/Data 607/Week 07/books.xml")
#parse
xml_parsed<-xmlParse(file = "C:/Users/zahir/Documents/Data 607/Week 07/books.xml")
#read into table
xml_table<-xmlToDataFrame(xml_parsed, stringsAsFactors = FALSE)
#view
datatable(xml_table)

Working with JSON

JSON code to build file

‘code’

{“booksonR” :[ { “Title” : “R for Data Science”, “Author1” : “Hadley Wickham,”, “Author2” : “Garrett Grolemund”, “Subject” : “Basic R”, “ISBN” : “978-1491910399”, “Price” : “38.20” }, { “Title” : “Art of R Programming”, “Author1” : “Norman Matloff”, “Author2” : "“,”Subject" : “R Programming”, “ISBN” : “978-1593273842”, “Price” : “20.16” }, { “Title” : “Machine Learning with R”, “Author1” : “Brett Lantz”, “Author2” : "“,
”Subject" : “R Machine Learning”, “ISBN” : “978-1784393908”, “Price” : “44.93” }, ] }

‘code’

We used fromJSON to parse the file and do.call to bind into a dataframe.

#path
path<-"https://raw.githubusercontent.com/zahirf/Data607/master/books.json"
#download
download.file(path, destfile = "C:/Users/zahir/Documents/Data 607/Week 07/books.json")
#parse
json_parsed<-fromJSON(content = "C:/Users/zahir/Documents/Data 607/Week 07/books.json")
#read into table
json<-sapply(json_parsed[[1]], unlist)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
json_table<-do.call("rbind", lapply(lapply(json[[1]],t), data.frame, stringsAsFactors = FALSE))
#view
datatable(json_table)

Comparing the three data frames

glimpse(html_table)
## Observations: 3
## Variables: 5
## $ Book      <chr> "R for Data Science", "Art of R Programming", "Machi...
## $ Author    <chr> "Hadley Wickham,\r\r\n      Garrett Grolemund", "Nor...
## $ Subject   <chr> "Basic R", "R Programming", "R Machine Learning"
## $ `ISBN 13` <chr> "978-1491910399", "978-1593273842", "978-1784393908"
## $ `Price $` <chr> "38.20", "20.16", "44.93"
glimpse(xml_table)
## Observations: 3
## Variables: 6
## $ Title   <chr> "R for Data Science", "Art of R Programming", "Machine...
## $ Author1 <chr> "Hadley Wickham,", "Norman Matloff", "Brett Lantz"
## $ Author2 <chr> "Garrett Grolemund", NA, NA
## $ Subject <chr> "Basic R", "R Programming", "R Machine Learning"
## $ ISBN    <chr> "978-1491910399", "978-1593273842", "978-1784393908"
## $ Price   <chr> "38.20", "20.16", "44.93"
glimpse(json_table)
## Observations: 6
## Variables: 1
## $ X..i.. <chr> "R for Data Science", "Hadley Wickham,", "Garrett Grole...

We see that the HTML and XML dataframes are identical, there are 3 observations of 5 variables. However we know that there are basic differences in the language as HTML uses defined tags to store the data, but in XML users may use their own tag names. We see the impact here. XML has picked up author 1 and author 2 as separate, but HTML has combined all author names.

JSON uses the javascript dictionary format and not the tag format. We see that in JSON, all the data were imported in the form of a list(15 observations of 1 variable).We unlisted the data so it could be transformed into a dataframe.