Working with XML, HTML and JSON in R

Assignment

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Libraries

Add appropriate libraries in order to run the R code smoothly.

library(bitops)
library(knitr)
library(XML)
library(RCurl)
library(jsonlite)

Load XML

Read xml file that stored in github and use XML package to get it into R. parsedXML() function will parse the XML file directly from Web. Then I will convert the XML file to a dataframe for a better visualization.

xml_file <- "https://raw.githubusercontent.com/gpadmaperuma/DATA607/master/books.xml"
xml_file <- getURL(xml_file)

#Parsing the Code using xmlParse
parsedXML <- xmlParse(file = xml_file[1])
# Convert data into a dataframe
xml_DF <- xmlToDataFrame(parsedXML)
xml_DF
##   ID                                      Title
## 1 01 Internet and World Wide Web How to Program
## 2 02         A First Course in Database Systems
## 3 03                  Data Science for Business
##                              Author       ISBN-13               Publisher
## 1         P.J. Deitel, H. M. Deitel 9780131752429 Pearson Education, Inc.
## 2 Jeffrey D. Ullman, Jennifer Widom 9789332535206           Pearson India
## 3       Foster Provost, Tom Fawcett 9781449361327          O'Reilly Media
##   Publication_date Pages      Related_Subject
## 1             2008  1373      Web Programming
## 2             2007   504 Database Programming
## 3       12/19/2013   369         Data Science
#check to see whether R knows parsedXML is in XML
class(parsedXML)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

Load Json

Read json file that stored in github and use jsonlite package to get it into R. I am trying two different functions to get data. parse_json function and fromJson function will load data in different ways. fromjson function will output much more structured dataframe.

json_file <- "https://raw.githubusercontent.com/gpadmaperuma/DATA607/master/books.json"
json_file <- getURL(json_file)

#parse json file with parse_jason function
parsedJSON <- parse_json(json_file)
#read data with fromjason function
json_DF <- fromJSON("https://raw.githubusercontent.com/gpadmaperuma/DATA607/master/books.json")
json_DF
## $`book-table`
## $`book-table`$book
##   ID                                    Title
## 1 01 Internet & World Wide Web How to Program
## 2 02       A First Course in Database Systems
## 3 03                Data Science for Business
##                              Author       ISBN-13               Publisher
## 1         P.J. Deitel, H. M. Deitel 9780131752429 Pearson Education, Inc.
## 2 Jeffrey D. Ullman, Jennifer Widom 9789332535206           Pearson India
## 3       Foster Provost, Tom Fawcett 9781449361327          O'Reilly Media
##   Publication_date Pages      Related_Subject
## 1             2008  1373      Web Programming
## 2             2007   504 Database Programming
## 3       12/19/2013   369         Data Science

Load HTML

loading HTML files into R is very straight forward. Just like the other files, HTML file is also stored in github.

html_file <- getURL("https://raw.githubusercontent.com/gpadmaperuma/DATA607/master/books.html")
html_DF <- readHTMLTable(html_file, which = 1)
html_DF
##   ID                                    Title
## 1 01 Internet & World Wide Web How to Program
## 2 02       A First Course in Database Systems
## 3 03                Data Science for Business
##                              Author       ISBN-13               Publisher
## 1         P.J. Deitel, H. M. Deitel 9780131752429 Pearson Education, Inc.
## 2 Jeffrey D. Ullman, Jennifer Widom 9789332535206           Pearson India
## 3       Foster Provost, Tom Fawcett 9781449361327          O'Reilly Media
##   Publication_date Pages      Related_Subject
## 1             2008  1373      Web Programming
## 2             2007   504 Database Programming
## 3       12/19/2013   369         Data Science

Comparison

xml, html and json dataframs are almost identical. One difference is that both xml and html dataframe variables are factors while in json it is in character format.Structure of json file is bit different than the other two.

str(html_DF)
## 'data.frame':    3 obs. of  8 variables:
##  $ ID              : Factor w/ 3 levels "01","02","03": 1 2 3
##  $ Title           : Factor w/ 3 levels "A First Course in Database Systems",..: 3 1 2
##  $ Author          : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 3 2 1
##  $ ISBN-13         : Factor w/ 3 levels "9780131752429",..: 1 3 2
##  $ Publisher       : Factor w/ 3 levels "O'Reilly Media",..: 2 3 1
##  $ Publication_date: Factor w/ 3 levels "12/19/2013","2007",..: 3 2 1
##  $ Pages           : Factor w/ 3 levels "1373","369","504": 1 3 2
##  $ Related_Subject : Factor w/ 3 levels "Data Science",..: 3 2 1
str(json_DF)
## List of 1
##  $ book-table:List of 1
##   ..$ book:'data.frame': 3 obs. of  8 variables:
##   .. ..$ ID              : chr [1:3] "01" "02" "03"
##   .. ..$ Title           : chr [1:3] "Internet & World Wide Web How to Program" "A First Course in Database Systems" "Data Science for Business"
##   .. ..$ Author          : chr [1:3] "P.J. Deitel, H. M. Deitel" "Jeffrey D. Ullman, Jennifer Widom" "Foster Provost, Tom Fawcett"
##   .. ..$ ISBN-13         : chr [1:3] "9780131752429" "9789332535206" "9781449361327"
##   .. ..$ Publisher       : chr [1:3] "Pearson Education, Inc." "Pearson India" "O'Reilly Media"
##   .. ..$ Publication_date: chr [1:3] "2008" "2007" "12/19/2013"
##   .. ..$ Pages           : chr [1:3] "1373" "504" "369"
##   .. ..$ Related_Subject : chr [1:3] "Web Programming" "Database Programming" "Data Science"
str(xml_DF)
## 'data.frame':    3 obs. of  8 variables:
##  $ ID              : Factor w/ 3 levels "01","02","03": 1 2 3
##  $ Title           : Factor w/ 3 levels "A First Course in Database Systems",..: 3 1 2
##  $ Author          : Factor w/ 3 levels "Foster Provost, Tom Fawcett",..: 3 2 1
##  $ ISBN-13         : Factor w/ 3 levels "9780131752429",..: 1 3 2
##  $ Publisher       : Factor w/ 3 levels "O'Reilly Media",..: 2 3 1
##  $ Publication_date: Factor w/ 3 levels "12/19/2013","2007",..: 3 2 1
##  $ Pages           : Factor w/ 3 levels "1373","369","504": 1 3 2
##  $ Related_Subject : Factor w/ 3 levels "Data Science",..: 3 2 1