Assignment – Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Download the required packages

require(XML)
## Loading required package: XML
require(rvest)
## Loading required package: rvest
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
## 
##     xml
require(kableExtra)
## Loading required package: kableExtra
require(RCurl)
## Loading required package: RCurl
## Loading required package: bitops
require(methods)
require(xml2)
require(jsonlite)
## Loading required package: jsonlite

HTML

First file to load is the HTML table, we first upload the html file into the GitHub.

htmlgithub<-getURL("https://raw.githubusercontent.com/Luz917/books.html/master/book.html")
cat(htmlgithub)
## <table>
## <tr>
##   <th>Title</th>
##   <th>Author 1</th>
##   <th>Author 2</th>
##   <th>Pages</th>
##   <th>Publisher</th>
##   <th>ISBN</th>
## </tr>
## <body>
##  <tr>
##       <td>Harry Potter and the Soccerer's Stone</td>
##       <td> J.K Rowling</td>
##       <td>Mary GrandPre(Illustrator)
##       <td>322</td>
##       <td>Scholastic Inc</td>
##       <td>0439554934</td>
##  </tr>
##  <tr>
##       <td>Beautiful Creatures</td>
##       <td>Kami Garcia</td>
##       <td>Margaret Stohl</td>
##       <td>563</td>
##       <td>Little, Brown and Company</td>
##       <td>0316042676</td>
##  </tr>
##  <tr>
##       <td>The Hunger Games</td>
##       <td>Suzanne Collins</td>
##       <td>None</td>
##       <td>374</td>
##       <td>Scholastic Press</td>
##       <td>0439023483</td>
##   </tr>
## </body>
## </table>

Here we read in the HTML, and convert it to a data.frame. Once we do the data.frame we change the column names to remove the NULL. from each of the column titles. And finally use kableExtra to styalize the table.

books_html<-readHTMLTable(htmlgithub)
books_html<-data.frame(books_html)
colnames(books_html)<-c("Title","Author1","Author2","Pages","Publisher","ISBN")
kable(books_html) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "right")
Title Author1 Author2 Pages Publisher ISBN
Harry Potter and the Soccerer’s Stone J.K Rowling Mary GrandPre(Illustrator) 322 Scholastic Inc 0439554934
Beautiful Creatures Kami Garcia Margaret Stohl 563 Little, Brown and Company 0316042676
The Hunger Games Suzanne Collins None 374 Scholastic Press 0439023483

XML

Next we load the the xml from the github. XML Structure is slightly different from html its the same concept in where you have to open and close for each line with < > </ >.

xmlgithub<-getURLContent("https://raw.githubusercontent.com/Luz917/books.xml/master/book.xml")
cat(xmlgithub)
## <fiction>
##      <book id = "Rowling">
##                <Title>Harry Potter and the Soccerer's Stone</Title>
##                <Author1>J.K Rowling</Author1>
##                <Author2>Mary GrandPre(Illustrator)</Author2>
##                <Pages>322</Pages>
##                <Publisher>Scholastic Inc</Publisher>
##                <ISBN>0439554934</ISBN>
##       </book>
##       <book id = "Garcia">
##               <Title>Beautiful Creatures</Title>
##               <Author1>Kami Garcia</Author1>
##               <Author2>Margaret Stohl</Author2>
##               <Pages>563</Pages>
##               <Publisher>Little, Brown and Company</Publisher>
##               <ISBN>0316042676</ISBN>
##        </book>
##        <book id= "Collins">
##               <Title>The Hunger Games</Title>
##               <Author1>Suzanne Collins</Author1>
##               <Author2>None</Author2>
##               <Pages>374</Pages>
##               <Publisher>Scholastic Press</Publisher>
##               <ISBN>0439023483</ISBN>
##         </book>
## </fiction>

For XML we have to first parse the data, and after we do that we can change it into a data frame. In the XML we dont have to worry about changing the column names as we had to in html. And again we style the table with kableExtra.

books_xml<-xmlParse(xmlgithub)
books_xmldf<-xmlToDataFrame(books_xml)
kable(books_xmldf) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "right")
Title Author1 Author2 Pages Publisher ISBN
Harry Potter and the Soccerer’s Stone J.K Rowling Mary GrandPre(Illustrator) 322 Scholastic Inc 0439554934
Beautiful Creatures Kami Garcia Margaret Stohl 563 Little, Brown and Company 0316042676
The Hunger Games Suzanne Collins None 374 Scholastic Press 0439023483

JSON

Next we upload the JSON file from the github, and read in the document. JSON structure is completely different from XML and HTML, and it is less repetative then the other formats. Its a little bit easier to write.

jsongithub<-getURLContent("https://raw.githubusercontent.com/Luz917/-books.json/master/books.json")
cat(jsongithub)
## [
##   {
##     "Title": "Harry Potter and the Soccerer's Stone",
##     "Author1": "J.K Rowling",
##     "Author2": "Mary GrandPre(Illustrator)",
##     "Pages": 322,
##     "Publisher": "Scholastic Inc",
##     "ISBN": "0439554934" },
##   {
##     "Title": "Beautiful Creatures",
##     "Author1": "Kami Garcia",
##     "Author2": "Margaret Stohl",
##     "Pages": 563,
##     "Publisher": "Little, Brown and Company",
##     "ISBN": "0316042676" },
##   {
##     "Title": "The Hunger Games",
##     "Author1": "Suzanne Collins",
##     "Author2": "None",
##     "Pages": 374,
##     "Publisher": "Scholastic Press",
##      "ISBN": "0439023483" }
## ]

When it comes to the JSON you don’t have to worry about with just the fromJSON it does it for you. And then again we style the table with kableExtra

books_json<-fromJSON(jsongithub)
kable(books_json) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "right")
Title Author1 Author2 Pages Publisher ISBN
Harry Potter and the Soccerer’s Stone J.K Rowling Mary GrandPre(Illustrator) 322 Scholastic Inc 0439554934
Beautiful Creatures Kami Garcia Margaret Stohl 563 Little, Brown and Company 0316042676
The Hunger Games Suzanne Collins None 374 Scholastic Press 0439023483

Are the tables Identical?

All three tables HTML, XML, and JSON, although the structures of writing code are different the tables are identical.

References:

KableExtra https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

HTML Intro https://www.w3schools.com/html/html_intro.asp

JSOn Intro https://www.w3schools.com/js/js_json_intro.asp

XML Intro https://www.w3schools.com/xml/