In this exercise, we will attempt to pull identical data on three books out of three different file formats: JSON, HTML, and XML. We will create these file in R so we can see how each file is constructed. We will want to compile the data on each of the books into a dataframe.

suppressWarnings(suppressMessages(library(rjson)))
suppressWarnings(suppressMessages(library(XML)))

This is the data we will want to encode:

title author cuisine yearPub ISBN
Plenty Yolam Ottolenghi and Jonathan Lovekin Vegetarian 2011 1452101248
Mastering the Art of French Cooking Julia Child French 1961 375413405
Balaboosta Einat Admony Israeli 2013 1579655009

JSON

Here we will write our JSON data into a file in and see how R returns the data from the file.

exportJSON <- "{\"title\":[\"Plenty\",\"Mastering the Art of French Cooking\",\"Balaboosta\"],\"author\":[\"Yolam Ottolenghi and Jonathan Lovekin\",\"Julia Child\",\"Einat Admony\"],\"cuisine\":[\"Vegetarian\",\"French\",\"Israeli\"],\"yearPub\":[2011,1961,2013],\"ISBN\":[1452101248,375413405,1579655009]}"
                
write(exportJSON, "cookBooks.json") # create JSON file
json_data <- fromJSON(file="cookBooks.json") # now we read directly from the file
json_data #view
## $title
## [1] "Plenty"                             
## [2] "Mastering the Art of French Cooking"
## [3] "Balaboosta"                         
## 
## $author
## [1] "Yolam Ottolenghi and Jonathan Lovekin"
## [2] "Julia Child"                          
## [3] "Einat Admony"                         
## 
## $cuisine
## [1] "Vegetarian" "French"     "Israeli"   
## 
## $yearPub
## [1] 2011 1961 2013
## 
## $ISBN
## [1] 1452101248  375413405 1579655009

Now we will compile this into a dataframe:

jsonDF <- data.frame(json_data$title,json_data$author,json_data$cuisine,json_data$yearPub,json_data$ISBN)
for (i in 1:length(jsonDF)){colnames(jsonDF)[i] <- names(json_data)[i]}

HTML

First we will write our HTML data into a file and see how R returns the data:

exportHTML <- "
<table>
  <tr>
  <th>title</th>
  <th>author</th>
  <th>cuisine</th>
  <th>yearPub</th>
  <th>ISBN</th>
  </tr>
  <tr>
  <td>Plenty</td>
  <td>Yolam Ottolenghi and Jonathan Lovekin</td>
  <td>Vegetarian</td>
  <td>2011</td>
  <td>1452101248</td>
  </tr>
  <tr>
  <td>Mastering the Art of French Cooking</td>
  <td>Julia Child</td>
  <td>French</td>
  <td>1961</td>
  <td>375413405</td>
  </tr>
  <tr>
  <td>Balaboost</td>
  <td>Einat Admony</td>
  <td>Israeli</td>
  <td>2013</td>
  <td>1579655009</td>
  </tr>
  </table>"
write(exportHTML, "cookBooks.html") # create HTML file
htmlDF <- readHTMLTable("cookBooks.html")
class(htmlDF) #returns as a list
## [1] "list"
htmlDF
## $`NULL`
##                                 title
## 1                              Plenty
## 2 Mastering the Art of French Cooking
## 3                           Balaboost
##                                  author    cuisine yearPub       ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian    2011 1452101248
## 2                           Julia Child     French    1961  375413405
## 3                          Einat Admony    Israeli    2013 1579655009
htmlDF <- htmlDF$'NULL' # Now we have our data frame

XML

We will use the same tactic with XML as with the previous examples. Here, you will see the data is structured much differently. When we return the data, we will use the XML package to identify nodes and return specific data by its identifying tag. We will then manually construct the dataframe by author, title, cuisine type, publication year, and ISBN number.

exportXML <- "
<myLibrary>
  <book>
      <title>Plenty</title>
      <author>Yolam Ottolenghi</author>
      <author>Jonathan Lovekin</author>
      <cuisine>Vegetarian</cuisine>
      <yearPub>2011</yearPub> 
      <ISBN>1452101248</ISBN>
  </book>
  <book>
      <title>Mastering the Art of French Cooking</title>
      <author>Julia Child</author>
      <cuisine>French</cuisine>
      <yearPub>1961</yearPub> 
      <ISBN>375413405</ISBN>
  </book>
  <book>
      <title>Balaboosta</title>
      <author>Einat Admony</author>
      <cuisine>Israeli</cuisine> 
      <yearPub>2013</yearPub>
      <ISBN>1579655009</ISBN>
  </book>
</myLibrary>"
  
write(exportXML, "cookBooks.xml") # create XML file

xmlData <- xmlTreeParse("cookBooks.xml",useInternal=T)
ourRoot <- xmlRoot(xmlData)
xmlName(ourRoot) # identify parent node
## [1] "myLibrary"
names(ourRoot) # find its child nodes - this shows there are three books in myLibrary
##   book   book   book 
## "book" "book" "book"
# like a list, calling top[1] will return the first book
ourRoot[1]
## $book
## <book>
##   <title>Plenty</title>
##   <author>Yolam Ottolenghi</author>
##   <author>Jonathan Lovekin</author>
##   <cuisine>Vegetarian</cuisine>
##   <yearPub>2011</yearPub>
##   <ISBN>1452101248</ISBN>
## </book> 
## 
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
# we could identify the child nodes for each books by calling names(top[[1])
names(ourRoot[[1]])
##     title    author    author   cuisine   yearPub      ISBN 
##   "title"  "author"  "author" "cuisine" "yearPub"    "ISBN"
# We can built this dataframe by pulling the values one at a time:
title <- xpathApply(xmlData, "//title", xmlValue)
title <- as.character(title)
author <- xpathApply(xmlData, "//author", xmlValue)
author <- as.character(author)
cuisine <- xpathApply(xmlData, "//cuisine", xmlValue)
cuisine <- as.character(cuisine)
yearPub <- xpathApply(xmlData, "//yearPub", xmlValue)
yearPub <- as.character(yearPub)
ISBN <- xpathApply(xmlData, "//ISBN", xmlValue)
ISBN <- as.character(ISBN)

Now we need to combine the two authors in the “author” vector for the book “Plenty”. Since each row must represent one book, we need to make sure the two authors occupy one “cell”.

author[1] <- paste(author[1],"and",author[2])
author <- author[-2]
xmlDF <- data.frame(title,author,cuisine,yearPub,ISBN) # compile dataframe

Now we can see if the dataframes are identical:

xmlDF
##                                 title
## 1                              Plenty
## 2 Mastering the Art of French Cooking
## 3                          Balaboosta
##                                  author    cuisine yearPub       ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian    2011 1452101248
## 2                           Julia Child     French    1961  375413405
## 3                          Einat Admony    Israeli    2013 1579655009
htmlDF
##                                 title
## 1                              Plenty
## 2 Mastering the Art of French Cooking
## 3                           Balaboost
##                                  author    cuisine yearPub       ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian    2011 1452101248
## 2                           Julia Child     French    1961  375413405
## 3                          Einat Admony    Israeli    2013 1579655009
jsonDF
##                                 title
## 1                              Plenty
## 2 Mastering the Art of French Cooking
## 3                          Balaboosta
##                                  author    cuisine yearPub       ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian    2011 1452101248
## 2                           Julia Child     French    1961  375413405
## 3                          Einat Admony    Israeli    2013 1579655009