Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
#load package
library(XML)
library(DT)
library(RJSONIO)
library(tidyverse)
We built the HTML file and uploaded to github. The code can be accessed from github and the rsulting html is shown below:
‘code’ <!DOCTYPE html>| Book | Author | Subject | ISBN 13 | Price $ |
|---|---|---|---|---|
| R for Data Science |
Hadley Wickham,
Garrett Grolemund
|
Basic R | 978-1491910399 | 38.20 |
| Art of R Programming | Norman Matloff | R Programming | 978-1593273842 | 20.16 |
| Machine Learning with R | Brett Lantz | R Machine Learning | 978-1784393908 | 44.93 |
‘code’
We then downloaded the file from github and parsed it, then we used readHTMLTable to read it into a table.
#read path
path<-"https://raw.githubusercontent.com/zahirf/Data607/master/books.html"
#download
download.file(path, destfile = "C:/Users/zahir/Documents/Data 607/Week 07/books.html")
#parse
parsed<-htmlParse(file = "C:/Users/zahir/Documents/Data 607/Week 07/books.html")
#read into table
html_table<-readHTMLTable(parsed, which=1, stringsAsFactors = FALSE)
#view
datatable(html_table)
The XML code to build the file can be accessed at githuband is shown below:
‘code’
<Author1>Hadley Wickham,</Author1>
<Author2>Garrett Grolemund</Author2>
<Subject>Basic R</Subject>
<ISBN>978-1491910399</ISBN>
<Price>38.20</Price>
</book>
<book id="2">
<Title>Art of R Programming</Title>
<Author1>Norman Matloff</Author1>
<Subject>R Programming</Subject>
<ISBN>978-1593273842</ISBN>
<Price>20.16</Price>
</book>
<book id="3">
<Title>Machine Learning with R</Title>
<Author1>Brett Lantz</Author1>
<Subject>R Machine Learning</Subject>
<ISBN>978-1784393908</ISBN>
<Price>44.93</Price>
</book>
‘code’
We did the same steps as with the HTML file but used xmlToDataFrame to read into table this time
#path
path<-"https://raw.githubusercontent.com/zahirf/Data607/master/books.xml"
#download
download.file(path, destfile = "C:/Users/zahir/Documents/Data 607/Week 07/books.xml")
#parse
xml_parsed<-xmlParse(file = "C:/Users/zahir/Documents/Data 607/Week 07/books.xml")
#read into table
xml_table<-xmlToDataFrame(xml_parsed, stringsAsFactors = FALSE)
#view
datatable(xml_table)
JSON code to build file
‘code’
{“booksonR” :[ { “Title” : “R for Data Science”, “Author1” : “Hadley Wickham,”, “Author2” : “Garrett Grolemund”, “Subject” : “Basic R”, “ISBN” : “978-1491910399”, “Price” : “38.20” }, { “Title” : “Art of R Programming”, “Author1” : “Norman Matloff”, “Author2” : "“,”Subject" : “R Programming”, “ISBN” : “978-1593273842”, “Price” : “20.16” }, { “Title” : “Machine Learning with R”, “Author1” : “Brett Lantz”, “Author2” : "“,
”Subject" : “R Machine Learning”, “ISBN” : “978-1784393908”, “Price” : “44.93” }, ] }
‘code’
We used fromJSON to parse the file and do.call to bind into a dataframe.
#path
path<-"https://raw.githubusercontent.com/zahirf/Data607/master/books.json"
#download
download.file(path, destfile = "C:/Users/zahir/Documents/Data 607/Week 07/books.json")
#parse
json_parsed<-fromJSON(content = "C:/Users/zahir/Documents/Data 607/Week 07/books.json")
#read into table
json<-sapply(json_parsed[[1]], unlist)
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
json_table<-do.call("rbind", lapply(lapply(json[[1]],t), data.frame, stringsAsFactors = FALSE))
#view
datatable(json_table)
glimpse(html_table)
## Observations: 3
## Variables: 5
## $ Book <chr> "R for Data Science", "Art of R Programming", "Machi...
## $ Author <chr> "Hadley Wickham,\r\r\n Garrett Grolemund", "Nor...
## $ Subject <chr> "Basic R", "R Programming", "R Machine Learning"
## $ `ISBN 13` <chr> "978-1491910399", "978-1593273842", "978-1784393908"
## $ `Price $` <chr> "38.20", "20.16", "44.93"
glimpse(xml_table)
## Observations: 3
## Variables: 6
## $ Title <chr> "R for Data Science", "Art of R Programming", "Machine...
## $ Author1 <chr> "Hadley Wickham,", "Norman Matloff", "Brett Lantz"
## $ Author2 <chr> "Garrett Grolemund", NA, NA
## $ Subject <chr> "Basic R", "R Programming", "R Machine Learning"
## $ ISBN <chr> "978-1491910399", "978-1593273842", "978-1784393908"
## $ Price <chr> "38.20", "20.16", "44.93"
glimpse(json_table)
## Observations: 6
## Variables: 1
## $ X..i.. <chr> "R for Data Science", "Hadley Wickham,", "Garrett Grole...
We see that the HTML and XML dataframes are identical, there are 3 observations of 5 variables. However we know that there are basic differences in the language as HTML uses defined tags to store the data, but in XML users may use their own tag names. We see the impact here. XML has picked up author 1 and author 2 as separate, but HTML has combined all author names.
JSON uses the javascript dictionary format and not the tag format. We see that in JSON, all the data were imported in the form of a list(15 observations of 1 variable).We unlisted the data so it could be transformed into a dataframe.