library(knitr)
library(jsonlite)
library(XML)
The books chosen were Humor and Adventure books for a younger audience and are imported in from a csv file just to show what the result is meant to look like. The attributes added in addition to the book title and author(s) are the number of Pages, the year the book was Published, the ISBN number and the star rating out of 5 the book was given on Goodreads. For the three files, imported below (JSON, XML and HTML), they were created by hand and imported from GitHub urls.
Books = read.csv("Books.csv")
kable(Books)
| Title | Author.1 | Author.2 | Pages | Published | ISBN | Goodreads.Rating |
|---|---|---|---|---|---|---|
| Good Omens | Neil Gaiman | Terry Pratchett | 430 | 2006 | “0060853980” | 4.25 |
| Thief of Time | Terry Pratchett | 378 | 2008 | “0061031321” | 4.24 | |
| The Graveyard Book | Neil Gaiman | 312 | 2008 | “0060530928” | 4.11 | |
| The Hitchhiker’s Guide to the Galaxy | Douglas Adams | 193 | 1997 | “0345418913” | 4.20 |
In order to use the fromJSON() function, we first must load the “jsonlite” library (all libraries were loaded at the top of the page).
# JSON
jbooks = fromJSON("Books.json")
jbooks
## $`Humor and Adventure Books`
## Title Author 1 Author 2
## 1 Good Omens Neil Gaiman Terry Pratchett
## 2 Thief of Time Terry Pratchett
## 3 The Graveyard Book Neil Gaiman
## 4 The Hitchhiker's Guide to the Galaxy Douglas Adams
## Pages Published ISBN Goodreads Rating
## 1 430 2006 0060853980 4.25
## 2 378 2008 0061031321 4.24
## 3 312 2008 0060530928 4.11
## 4 193 1997 0345418913 4.20
As seen above, the JSON file is already formatted very similarly to a R dataframe, so, so editing it to look like the original csv file is very simple. If anything, the column headers are already formatted to have a space between words, unlike the csv file (which substitutes spaces with periods).
kable(jbooks)
|
For XML, we load the “XML” library to use the xmlTreeParse() function. This will read the file directly from the url. This shows what the original XML file looks like. However, this format must be changed into a dataframe. The xmlToDataFrame() function does this direcly.
# XML
xbooks = xmlTreeParse("Books.xml")
xbooks
## $doc
## $file
## [1] "Books.xml"
##
## $version
## [1] "1.0"
##
## $children
## $children$Humor_and_Adventure_Books
## <Humor_and_Adventure_Books>
## <Book id="1">
## <Title>Good Omens</Title>
## <Author_1>Neil Gaiman</Author_1>
## <Author_2>Terry Pratchett</Author_2>
## <Pages>430</Pages>
## <Published>2006</Published>
## <ISBN>0060853980</ISBN>
## <Goodreads_Rating>4.25</Goodreads_Rating>
## </Book>
## <Book id="2">
## <Title>Thief of Time</Title>
## <Author_1>Terry Pratchett</Author_1>
## <Author_2/>
## <Pages>378</Pages>
## <Published>2008</Published>
## <ISBN>0061031321</ISBN>
## <Goodreads_Rating>4.24</Goodreads_Rating>
## </Book>
## <Book id="3">
## <Title>The Graveyard Book</Title>
## <Author_1>Neil Gaiman</Author_1>
## <Author_2/>
## <Pages>312</Pages>
## <Published>2008</Published>
## <ISBN>0060530928</ISBN>
## <Goodreads_Rating>4.11</Goodreads_Rating>
## </Book>
## <Book id="4">
## <Title>The Hitchhiker's Guide to the Galaxy</Title>
## <Author_1>Douglas Adams</Author_1>
## <Author_2/>
## <Pages>193</Pages>
## <Published>1997</Published>
## <ISBN>0345418913</ISBN>
## <Goodreads_Rating>4.2</Goodreads_Rating>
## </Book>
## </Humor_and_Adventure_Books>
##
##
## attr(,"class")
## [1] "XMLDocumentContent"
##
## $dtd
## $external
## NULL
##
## $internal
## NULL
##
## attr(,"class")
## [1] "DTDList"
##
## attr(,"class")
## [1] "XMLDocument" "XMLAbstractDocument"
xbooks = xmlToDataFrame("Books.xml")
xbooks
## Title Author_1 Author_2
## 1 Good Omens Neil Gaiman Terry Pratchett
## 2 Thief of Time Terry Pratchett
## 3 The Graveyard Book Neil Gaiman
## 4 The Hitchhiker's Guide to the Galaxy Douglas Adams
## Pages Published ISBN Goodreads_Rating
## 1 430 2006 0060853980 4.25
## 2 378 2008 0061031321 4.24
## 3 312 2008 0060530928 4.11
## 4 193 1997 0345418913 4.2
The only difference in the final formatting is that the column titles cannot have spaces and therefore an underscore is substituted in the spaces between words in column names.
kable(xbooks)
| Title | Author_1 | Author_2 | Pages | Published | ISBN | Goodreads_Rating |
|---|---|---|---|---|---|---|
| Good Omens | Neil Gaiman | Terry Pratchett | 430 | 2006 | 0060853980 | 4.25 |
| Thief of Time | Terry Pratchett | 378 | 2008 | 0061031321 | 4.24 | |
| The Graveyard Book | Neil Gaiman | 312 | 2008 | 0060530928 | 4.11 | |
| The Hitchhiker’s Guide to the Galaxy | Douglas Adams | 193 | 1997 | 0345418913 | 4.2 |
# HTML
hbooks = readHTMLTable("Books.html", header = T)
hbooks
## $`NULL`
## Title Author 1 Author 2
## 1 Good Omens Neil Gaiman Terry Pratchett
## 2 Thief of Time Terry Pratchett
## 3 The Graveyard Book Neil Gaiman
## 4 The Hitchhiker's Guide to the Galaxy Douglas Adams
## Pages Published ISBN Goodreads Rating
## 1 430 2006 0060853980 4.25
## 2 378 2008 0061031321 4.24
## 3 312 2008 0060530928 4.11
## 4 193 1997 0345418913 4.2
Similarly to the JSON file, the HTML file was simple to read into R. It, too, had already formatted the column names.
kable(hbooks)
|
In determining if the three files are identical, the answer would have to be no. However, the differences between files are so minute that just a bit of manipulation in R can change that decision. For example, as mentioned above, the column names differ between some of the formats. In addition, the JSON Goodreads Rating column aligns the numbers to the right, unlike in XML and HTML which align to the left. Such minor differences are not significant enough to deem one format better than another, however it does exclude them from being identical.