Assignment 7 - Working with XML and JSON in R

library(knitr)
library(jsonlite)
library(XML)

Assignment Goals:

Pick Books

Pick three books on one subject
At least one with more than one author
Include Title, Author and 3 other characteristics

Create three separate files

One is JSON
The other in XML
and the third is HTML (using html table)
Create them by hand for practice

Load the three files into R

Read them in separately
Compare if they are identical

The books chosen were Humor and Adventure books for a younger audience and are imported in from a csv file just to show what the result is meant to look like. The attributes added in addition to the book title and author(s) are the number of Pages, the year the book was Published, the ISBN number and the star rating out of 5 the book was given on Goodreads. For the three files, imported below (JSON, XML and HTML), they were created by hand and imported from GitHub urls.

Books = read.csv("Books.csv")
kable(Books)

Title	Author.1	Author.2	Pages	Published	ISBN	Goodreads.Rating
Good Omens	Neil Gaiman	Terry Pratchett	430	2006	“0060853980”	4.25
Thief of Time	Terry Pratchett		378	2008	“0061031321”	4.24
The Graveyard Book	Neil Gaiman		312	2008	“0060530928”	4.11
The Hitchhiker’s Guide to the Galaxy	Douglas Adams		193	1997	“0345418913”	4.20

JSON

In order to use the fromJSON() function, we first must load the “jsonlite” library (all libraries were loaded at the top of the page).

# JSON
jbooks = fromJSON("Books.json")
jbooks

## $`Humor and Adventure Books`
##                                  Title        Author 1        Author 2
## 1                           Good Omens     Neil Gaiman Terry Pratchett
## 2                        Thief of Time Terry Pratchett                
## 3                   The Graveyard Book     Neil Gaiman                
## 4 The Hitchhiker's Guide to the Galaxy   Douglas Adams                
##   Pages Published       ISBN Goodreads Rating
## 1   430      2006 0060853980             4.25
## 2   378      2008 0061031321             4.24
## 3   312      2008 0060530928             4.11
## 4   193      1997 0345418913             4.20

As seen above, the JSON file is already formatted very similarly to a R dataframe, so, so editing it to look like the original csv file is very simple. If anything, the column headers are already formatted to have a space between words, unlike the csv file (which substitutes spaces with periods).

kable(jbooks)

Title	Author 1	Author 2	Pages	Published	ISBN	Goodreads Rating
Good Omens	Neil Gaiman	Terry Pratchett	430	2006	0060853980	4.25
Thief of Time	Terry Pratchett		378	2008	0061031321	4.24
The Graveyard Book	Neil Gaiman		312	2008	0060530928	4.11
The Hitchhiker’s Guide to the Galaxy	Douglas Adams		193	1997	0345418913	4.20

XML

For XML, we load the “XML” library to use the xmlTreeParse() function. This will read the file directly from the url. This shows what the original XML file looks like. However, this format must be changed into a dataframe. The xmlToDataFrame() function does this direcly.

# XML
xbooks = xmlTreeParse("Books.xml")
xbooks

## $doc
## $file
## [1] "Books.xml"
## 
## $version
## [1] "1.0"
## 
## $children
## $children$Humor_and_Adventure_Books
## <Humor_and_Adventure_Books>
##  <Book id="1">
##   <Title>Good Omens</Title>
##   <Author_1>Neil Gaiman</Author_1>
##   <Author_2>Terry Pratchett</Author_2>
##   <Pages>430</Pages>
##   <Published>2006</Published>
##   <ISBN>0060853980</ISBN>
##   <Goodreads_Rating>4.25</Goodreads_Rating>
##  </Book>
##  <Book id="2">
##   <Title>Thief of Time</Title>
##   <Author_1>Terry Pratchett</Author_1>
##   <Author_2/>
##   <Pages>378</Pages>
##   <Published>2008</Published>
##   <ISBN>0061031321</ISBN>
##   <Goodreads_Rating>4.24</Goodreads_Rating>
##  </Book>
##  <Book id="3">
##   <Title>The Graveyard Book</Title>
##   <Author_1>Neil Gaiman</Author_1>
##   <Author_2/>
##   <Pages>312</Pages>
##   <Published>2008</Published>
##   <ISBN>0060530928</ISBN>
##   <Goodreads_Rating>4.11</Goodreads_Rating>
##  </Book>
##  <Book id="4">
##   <Title>The Hitchhiker&apos;s Guide to the Galaxy</Title>
##   <Author_1>Douglas Adams</Author_1>
##   <Author_2/>
##   <Pages>193</Pages>
##   <Published>1997</Published>
##   <ISBN>0345418913</ISBN>
##   <Goodreads_Rating>4.2</Goodreads_Rating>
##  </Book>
## </Humor_and_Adventure_Books>
## 
## 
## attr(,"class")
## [1] "XMLDocumentContent"
## 
## $dtd
## $external
## NULL
## 
## $internal
## NULL
## 
## attr(,"class")
## [1] "DTDList"
## 
## attr(,"class")
## [1] "XMLDocument"         "XMLAbstractDocument"

xbooks = xmlToDataFrame("Books.xml")
xbooks

##                                  Title        Author_1        Author_2
## 1                           Good Omens     Neil Gaiman Terry Pratchett
## 2                        Thief of Time Terry Pratchett                
## 3                   The Graveyard Book     Neil Gaiman                
## 4 The Hitchhiker's Guide to the Galaxy   Douglas Adams                
##   Pages Published       ISBN Goodreads_Rating
## 1   430      2006 0060853980             4.25
## 2   378      2008 0061031321             4.24
## 3   312      2008 0060530928             4.11
## 4   193      1997 0345418913              4.2

The only difference in the final formatting is that the column titles cannot have spaces and therefore an underscore is substituted in the spaces between words in column names.

kable(xbooks)

Title	Author_1	Author_2	Pages	Published	ISBN	Goodreads_Rating
Good Omens	Neil Gaiman	Terry Pratchett	430	2006	0060853980	4.25
Thief of Time	Terry Pratchett		378	2008	0061031321	4.24
The Graveyard Book	Neil Gaiman		312	2008	0060530928	4.11
The Hitchhiker’s Guide to the Galaxy	Douglas Adams		193	1997	0345418913	4.2

HTML

# HTML
hbooks = readHTMLTable("Books.html", header = T)
hbooks

## $`NULL`
##                                  Title        Author 1        Author 2
## 1                           Good Omens     Neil Gaiman Terry Pratchett
## 2                        Thief of Time Terry Pratchett                
## 3                   The Graveyard Book     Neil Gaiman                
## 4 The Hitchhiker's Guide to the Galaxy   Douglas Adams                
##   Pages Published       ISBN Goodreads Rating
## 1   430      2006 0060853980             4.25
## 2   378      2008 0061031321             4.24
## 3   312      2008 0060530928             4.11
## 4   193      1997 0345418913              4.2

Similarly to the JSON file, the HTML file was simple to read into R. It, too, had already formatted the column names.

kable(hbooks)

Title	Author 1	Author 2	Pages	Published	ISBN	Goodreads Rating
Good Omens	Neil Gaiman	Terry Pratchett	430	2006	0060853980	4.25
Thief of Time	Terry Pratchett		378	2008	0061031321	4.24
The Graveyard Book	Neil Gaiman		312	2008	0060530928	4.11
The Hitchhiker’s Guide to the Galaxy	Douglas Adams		193	1997	0345418913	4.2

Final Thoughts

In determining if the three files are identical, the answer would have to be no. However, the differences between files are so minute that just a bit of manipulation in R can change that decision. For example, as mentioned above, the column names differ between some of the formats. In addition, the JSON Goodreads Rating column aligns the numbers to the right, unlike in XML and HTML which align to the left. Such minor differences are not significant enough to deem one format better than another, however it does exclude them from being identical.

Assignment 7 - Working with XML and JSON in R

Georgia Galanopoulos

March 19, 2017

Assignment Goals:

JSON

XML

HTML

Final Thoughts