Week 7 Assignment Data 607

Assignment – Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Download the required packages

require(XML)

## Loading required package: XML

require(rvest)

## Loading required package: rvest

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:XML':
## 
##     xml

require(kableExtra)

## Loading required package: kableExtra

require(RCurl)

## Loading required package: RCurl

## Loading required package: bitops

require(methods)
require(xml2)
require(jsonlite)

## Loading required package: jsonlite

HTML

First file to load is the HTML table, we first upload the html file into the GitHub.

htmlgithub<-getURL("https://raw.githubusercontent.com/Luz917/books.html/master/book.html")
cat(htmlgithub)

## <table>
## <tr>
##   <th>Title</th>
##   <th>Author 1</th>
##   <th>Author 2</th>
##   <th>Pages</th>
##   <th>Publisher</th>
##   <th>ISBN</th>
## </tr>
## <body>
##  <tr>
##       <td>Harry Potter and the Soccerer's Stone</td>
##       <td> J.K Rowling</td>
##       <td>Mary GrandPre(Illustrator)
##       <td>322</td>
##       <td>Scholastic Inc</td>
##       <td>0439554934</td>
##  </tr>
##  <tr>
##       <td>Beautiful Creatures</td>
##       <td>Kami Garcia</td>
##       <td>Margaret Stohl</td>
##       <td>563</td>
##       <td>Little, Brown and Company</td>
##       <td>0316042676</td>
##  </tr>
##  <tr>
##       <td>The Hunger Games</td>
##       <td>Suzanne Collins</td>
##       <td>None</td>
##       <td>374</td>
##       <td>Scholastic Press</td>
##       <td>0439023483</td>
##   </tr>
## </body>
## </table>

Here we read in the HTML, and convert it to a data.frame. Once we do the data.frame we change the column names to remove the NULL. from each of the column titles. And finally use kableExtra to styalize the table.

books_html<-readHTMLTable(htmlgithub)
books_html<-data.frame(books_html)
colnames(books_html)<-c("Title","Author1","Author2","Pages","Publisher","ISBN")
kable(books_html) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "right")

Title	Author1	Author2	Pages	Publisher	ISBN
Harry Potter and the Soccerer’s Stone	J.K Rowling	Mary GrandPre(Illustrator)	322	Scholastic Inc	0439554934
Beautiful Creatures	Kami Garcia	Margaret Stohl	563	Little, Brown and Company	0316042676
The Hunger Games	Suzanne Collins	None	374	Scholastic Press	0439023483

XML

Next we load the the xml from the github. XML Structure is slightly different from html its the same concept in where you have to open and close for each line with < > </ >.

xmlgithub<-getURLContent("https://raw.githubusercontent.com/Luz917/books.xml/master/book.xml")
cat(xmlgithub)

## <fiction>
##      <book id = "Rowling">
##                <Title>Harry Potter and the Soccerer's Stone</Title>
##                <Author1>J.K Rowling</Author1>
##                <Author2>Mary GrandPre(Illustrator)</Author2>
##                <Pages>322</Pages>
##                <Publisher>Scholastic Inc</Publisher>
##                <ISBN>0439554934</ISBN>
##       </book>
##       <book id = "Garcia">
##               <Title>Beautiful Creatures</Title>
##               <Author1>Kami Garcia</Author1>
##               <Author2>Margaret Stohl</Author2>
##               <Pages>563</Pages>
##               <Publisher>Little, Brown and Company</Publisher>
##               <ISBN>0316042676</ISBN>
##        </book>
##        <book id= "Collins">
##               <Title>The Hunger Games</Title>
##               <Author1>Suzanne Collins</Author1>
##               <Author2>None</Author2>
##               <Pages>374</Pages>
##               <Publisher>Scholastic Press</Publisher>
##               <ISBN>0439023483</ISBN>
##         </book>
## </fiction>

For XML we have to first parse the data, and after we do that we can change it into a data frame. In the XML we dont have to worry about changing the column names as we had to in html. And again we style the table with kableExtra.

books_xml<-xmlParse(xmlgithub)
books_xmldf<-xmlToDataFrame(books_xml)
kable(books_xmldf) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "right")

Title	Author1	Author2	Pages	Publisher	ISBN
Harry Potter and the Soccerer’s Stone	J.K Rowling	Mary GrandPre(Illustrator)	322	Scholastic Inc	0439554934
Beautiful Creatures	Kami Garcia	Margaret Stohl	563	Little, Brown and Company	0316042676
The Hunger Games	Suzanne Collins	None	374	Scholastic Press	0439023483

JSON

Next we upload the JSON file from the github, and read in the document. JSON structure is completely different from XML and HTML, and it is less repetative then the other formats. Its a little bit easier to write.

jsongithub<-getURLContent("https://raw.githubusercontent.com/Luz917/-books.json/master/books.json")
cat(jsongithub)

## [
##   {
##     "Title": "Harry Potter and the Soccerer's Stone",
##     "Author1": "J.K Rowling",
##     "Author2": "Mary GrandPre(Illustrator)",
##     "Pages": 322,
##     "Publisher": "Scholastic Inc",
##     "ISBN": "0439554934" },
##   {
##     "Title": "Beautiful Creatures",
##     "Author1": "Kami Garcia",
##     "Author2": "Margaret Stohl",
##     "Pages": 563,
##     "Publisher": "Little, Brown and Company",
##     "ISBN": "0316042676" },
##   {
##     "Title": "The Hunger Games",
##     "Author1": "Suzanne Collins",
##     "Author2": "None",
##     "Pages": 374,
##     "Publisher": "Scholastic Press",
##      "ISBN": "0439023483" }
## ]

When it comes to the JSON you don’t have to worry about with just the fromJSON it does it for you. And then again we style the table with kableExtra

books_json<-fromJSON(jsongithub)
kable(books_json) %>%
  kable_styling(bootstrap_options = "striped", full_width = F, position = "right")

Title	Author1	Author2	Pages	Publisher	ISBN
Harry Potter and the Soccerer’s Stone	J.K Rowling	Mary GrandPre(Illustrator)	322	Scholastic Inc	0439554934
Beautiful Creatures	Kami Garcia	Margaret Stohl	563	Little, Brown and Company	0316042676
The Hunger Games	Suzanne Collins	None	374	Scholastic Press	0439023483

Are the tables Identical?

All three tables HTML, XML, and JSON, although the structures of writing code are different the tables are identical.

References:

KableExtra https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

HTML Intro https://www.w3schools.com/html/html_intro.asp

JSOn Intro https://www.w3schools.com/js/js_json_intro.asp

XML Intro https://www.w3schools.com/xml/