This Assignment reads 3 files from the web (Github) which are HTML, XML and
JSON. The data is 3 books with attributes such as title, author, ISBN,
number of pages etc. Lets see how R reads in these three files
library(jsonlite)
library(XML)
library(xml2)
library(RCurl)
## Loading required package: bitops
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
Load JSON data
jsonURL <- "https://raw.githubusercontent.com/jonathan1987/CUNYSPS_IS607/master/Week7/books.json"
books_json <- fromJSON(jsonURL)
htmlURL <- "https://raw.githubusercontent.com/jonathan1987/CUNYSPS_IS607/master/Week7/books.html"
books_HTML <- read_html(htmlURL)
books_HTML <- html_table(books_HTML)
xmlURL <- "https://raw.githubusercontent.com/jonathan1987/CUNYSPS_IS607/master/Week7/Books.xml"
books_XML <- xmlRoot(xmlParse(getURL(xmlURL))) # get XML file contents
# make into a dataset with ldply
books_XML_df <- ldply(xmlToList(books_XML), data.frame)
# remove the .id column
books_XML_df <- books_XML_df %>% select(-.id)
Print out each dataset
books_json
## $Book
## title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2 Learning SQL
## 3 Learning from Data
## Author
## 1 Cathy O'Neil, Rachel Schutt
## 2 Alan Beaulieu
## 3 Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin
## Publisher ISBN-10 ISBN-13 pages
## 1 O'Reilly Media; 1 edition 1449358659 978-1449358655 408
## 2 O'Reilly Media; 2nd edition 0596520832 978-0596520830 338
## 3 AMLBook 1600490069 978-1600490064 213
books_HTML
## [[1]]
## title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2 Learning SQL
## 3 Learning from Data
## Authorname1 Authorname2 Authorname3
## 1 Cathy O'Neil Rachel Schutt
## 2 Alan Beaulieu
## 3 Yaser S. Abu-Mostafa Malik Magdon-Ismail Hsuan-Tien Lin
## Publisher ISBN-10 ISBN-13 pages
## 1 O'Reilly Media; 1 edition 1449358659 978-1449358655 408
## 2 O'Reilly Media; 2nd edition 596520832 978-0596520830 338
## 3 AMLBook 1600490069 978-1600490064 213
books_XML_df
## title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2 Learning SQL
## 3 Learning from Data
## name name.1 Publisher
## 1 Cathy O'Neil Rachel Schutt O'Reilly Media; 1 edition
## 2 Alan Beaulieu <NA> O'Reilly Media; 2nd edition
## 3 Yaser S. Abu-Mostafa Malik Magdon-Ismail AMLBook
## ISBN.10 ISBN.13 pages name.2
## 1 1449358659 978-1449358655 408 <NA>
## 2 0596520832 978-0596520830 338 <NA>
## 3 1600490069 978-1600490064 213 Hsuan-Tien Lin
We can see that the datasets are similar except for when it comes to
multiple objects (authors in this case)
JSON takes all authors and puts it in one column
HTML has authors for a book across multiple columns
XML like HTML takes multiples authors for a books and puts them in
multiple columns.