Working with HTML, XML and, JSON Files

This Assignment reads 3 files from the web (Github) which are HTML, XML and

JSON. The data is 3 books with attributes such as title, author, ISBN,

number of pages etc. Lets see how R reads in these three files

library(jsonlite)
library(XML)
library(xml2)
library(RCurl)

## Loading required package: bitops

library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:XML':
## 
##     xml

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

Load JSON data

jsonURL <- "https://raw.githubusercontent.com/jonathan1987/CUNYSPS_IS607/master/Week7/books.json"
books_json <- fromJSON(jsonURL)

htmlURL <- "https://raw.githubusercontent.com/jonathan1987/CUNYSPS_IS607/master/Week7/books.html"
books_HTML <- read_html(htmlURL)
books_HTML <- html_table(books_HTML)

xmlURL <- "https://raw.githubusercontent.com/jonathan1987/CUNYSPS_IS607/master/Week7/Books.xml"
books_XML <- xmlRoot(xmlParse(getURL(xmlURL))) # get XML file contents

# make into a dataset with ldply
books_XML_df <- ldply(xmlToList(books_XML), data.frame)

# remove the .id column
books_XML_df <- books_XML_df %>% select(-.id)

Print out each dataset

books_json

## $Book
##                                                  title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2                                         Learning SQL
## 3                                   Learning from Data
##                                                      Author
## 1                               Cathy O'Neil, Rachel Schutt
## 2                                             Alan Beaulieu
## 3 Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin
##                     Publisher    ISBN-10        ISBN-13 pages
## 1   O'Reilly Media; 1 edition 1449358659 978-1449358655   408
## 2 O'Reilly Media; 2nd edition 0596520832 978-0596520830   338
## 3                     AMLBook 1600490069 978-1600490064   213

books_HTML

## [[1]]
##                                                  title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2                                         Learning SQL
## 3                                   Learning from Data
##            Authorname1         Authorname2    Authorname3
## 1         Cathy O'Neil       Rachel Schutt               
## 2        Alan Beaulieu                                   
## 3 Yaser S. Abu-Mostafa Malik Magdon-Ismail Hsuan-Tien Lin
##                     Publisher    ISBN-10        ISBN-13 pages
## 1   O'Reilly Media; 1 edition 1449358659 978-1449358655   408
## 2 O'Reilly Media; 2nd edition  596520832 978-0596520830   338
## 3                     AMLBook 1600490069 978-1600490064   213

books_XML_df

##                                                  title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2                                         Learning SQL
## 3                                   Learning from Data
##                   name              name.1                   Publisher
## 1         Cathy O'Neil       Rachel Schutt   O'Reilly Media; 1 edition
## 2        Alan Beaulieu                <NA> O'Reilly Media; 2nd edition
## 3 Yaser S. Abu-Mostafa Malik Magdon-Ismail                     AMLBook
##      ISBN.10        ISBN.13 pages         name.2
## 1 1449358659 978-1449358655   408           <NA>
## 2 0596520832 978-0596520830   338           <NA>
## 3 1600490069 978-1600490064   213 Hsuan-Tien Lin

Working with HTML, XML and, JSON Files

Jonathan Hernandez

October 14, 2016

This Assignment reads 3 files from the web (Github) which are HTML, XML and

JSON. The data is 3 books with attributes such as title, author, ISBN,

number of pages etc. Lets see how R reads in these three files

Load JSON data

Print out each dataset

We can see that the datasets are similar except for when it comes to

multiple objects (authors in this case)

JSON takes all authors and puts it in one column

HTML has authors for a book across multiple columns

XML like HTML takes multiples authors for a books and puts them in

multiple columns.