In this exercise, we will attempt to pull identical data on three books out of three different file formats: JSON, HTML, and XML. We will create these file in R so we can see how each file is constructed. We will want to compile the data on each of the books into a dataframe.
suppressWarnings(suppressMessages(library(rjson)))
suppressWarnings(suppressMessages(library(XML)))
This is the data we will want to encode:
| title | author | cuisine | yearPub | ISBN |
|---|---|---|---|---|
| Plenty | Yolam Ottolenghi and Jonathan Lovekin | Vegetarian | 2011 | 1452101248 |
| Mastering the Art of French Cooking | Julia Child | French | 1961 | 375413405 |
| Balaboosta | Einat Admony | Israeli | 2013 | 1579655009 |
Here we will write our JSON data into a file in and see how R returns the data from the file.
exportJSON <- "{\"title\":[\"Plenty\",\"Mastering the Art of French Cooking\",\"Balaboosta\"],\"author\":[\"Yolam Ottolenghi and Jonathan Lovekin\",\"Julia Child\",\"Einat Admony\"],\"cuisine\":[\"Vegetarian\",\"French\",\"Israeli\"],\"yearPub\":[2011,1961,2013],\"ISBN\":[1452101248,375413405,1579655009]}"
write(exportJSON, "cookBooks.json") # create JSON file
json_data <- fromJSON(file="cookBooks.json") # now we read directly from the file
json_data #view
## $title
## [1] "Plenty"
## [2] "Mastering the Art of French Cooking"
## [3] "Balaboosta"
##
## $author
## [1] "Yolam Ottolenghi and Jonathan Lovekin"
## [2] "Julia Child"
## [3] "Einat Admony"
##
## $cuisine
## [1] "Vegetarian" "French" "Israeli"
##
## $yearPub
## [1] 2011 1961 2013
##
## $ISBN
## [1] 1452101248 375413405 1579655009
Now we will compile this into a dataframe:
jsonDF <- data.frame(json_data$title,json_data$author,json_data$cuisine,json_data$yearPub,json_data$ISBN)
for (i in 1:length(jsonDF)){colnames(jsonDF)[i] <- names(json_data)[i]}
First we will write our HTML data into a file and see how R returns the data:
exportHTML <- "
<table>
<tr>
<th>title</th>
<th>author</th>
<th>cuisine</th>
<th>yearPub</th>
<th>ISBN</th>
</tr>
<tr>
<td>Plenty</td>
<td>Yolam Ottolenghi and Jonathan Lovekin</td>
<td>Vegetarian</td>
<td>2011</td>
<td>1452101248</td>
</tr>
<tr>
<td>Mastering the Art of French Cooking</td>
<td>Julia Child</td>
<td>French</td>
<td>1961</td>
<td>375413405</td>
</tr>
<tr>
<td>Balaboost</td>
<td>Einat Admony</td>
<td>Israeli</td>
<td>2013</td>
<td>1579655009</td>
</tr>
</table>"
write(exportHTML, "cookBooks.html") # create HTML file
htmlDF <- readHTMLTable("cookBooks.html")
class(htmlDF) #returns as a list
## [1] "list"
htmlDF
## $`NULL`
## title
## 1 Plenty
## 2 Mastering the Art of French Cooking
## 3 Balaboost
## author cuisine yearPub ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian 2011 1452101248
## 2 Julia Child French 1961 375413405
## 3 Einat Admony Israeli 2013 1579655009
htmlDF <- htmlDF$'NULL' # Now we have our data frame
We will use the same tactic with XML as with the previous examples. Here, you will see the data is structured much differently. When we return the data, we will use the XML package to identify nodes and return specific data by its identifying tag. We will then manually construct the dataframe by author, title, cuisine type, publication year, and ISBN number.
exportXML <- "
<myLibrary>
<book>
<title>Plenty</title>
<author>Yolam Ottolenghi</author>
<author>Jonathan Lovekin</author>
<cuisine>Vegetarian</cuisine>
<yearPub>2011</yearPub>
<ISBN>1452101248</ISBN>
</book>
<book>
<title>Mastering the Art of French Cooking</title>
<author>Julia Child</author>
<cuisine>French</cuisine>
<yearPub>1961</yearPub>
<ISBN>375413405</ISBN>
</book>
<book>
<title>Balaboosta</title>
<author>Einat Admony</author>
<cuisine>Israeli</cuisine>
<yearPub>2013</yearPub>
<ISBN>1579655009</ISBN>
</book>
</myLibrary>"
write(exportXML, "cookBooks.xml") # create XML file
xmlData <- xmlTreeParse("cookBooks.xml",useInternal=T)
ourRoot <- xmlRoot(xmlData)
xmlName(ourRoot) # identify parent node
## [1] "myLibrary"
names(ourRoot) # find its child nodes - this shows there are three books in myLibrary
## book book book
## "book" "book" "book"
# like a list, calling top[1] will return the first book
ourRoot[1]
## $book
## <book>
## <title>Plenty</title>
## <author>Yolam Ottolenghi</author>
## <author>Jonathan Lovekin</author>
## <cuisine>Vegetarian</cuisine>
## <yearPub>2011</yearPub>
## <ISBN>1452101248</ISBN>
## </book>
##
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
# we could identify the child nodes for each books by calling names(top[[1])
names(ourRoot[[1]])
## title author author cuisine yearPub ISBN
## "title" "author" "author" "cuisine" "yearPub" "ISBN"
# We can built this dataframe by pulling the values one at a time:
title <- xpathApply(xmlData, "//title", xmlValue)
title <- as.character(title)
author <- xpathApply(xmlData, "//author", xmlValue)
author <- as.character(author)
cuisine <- xpathApply(xmlData, "//cuisine", xmlValue)
cuisine <- as.character(cuisine)
yearPub <- xpathApply(xmlData, "//yearPub", xmlValue)
yearPub <- as.character(yearPub)
ISBN <- xpathApply(xmlData, "//ISBN", xmlValue)
ISBN <- as.character(ISBN)
Now we need to combine the two authors in the “author” vector for the book “Plenty”. Since each row must represent one book, we need to make sure the two authors occupy one “cell”.
author[1] <- paste(author[1],"and",author[2])
author <- author[-2]
xmlDF <- data.frame(title,author,cuisine,yearPub,ISBN) # compile dataframe
Now we can see if the dataframes are identical:
xmlDF
## title
## 1 Plenty
## 2 Mastering the Art of French Cooking
## 3 Balaboosta
## author cuisine yearPub ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian 2011 1452101248
## 2 Julia Child French 1961 375413405
## 3 Einat Admony Israeli 2013 1579655009
htmlDF
## title
## 1 Plenty
## 2 Mastering the Art of French Cooking
## 3 Balaboost
## author cuisine yearPub ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian 2011 1452101248
## 2 Julia Child French 1961 375413405
## 3 Einat Admony Israeli 2013 1579655009
jsonDF
## title
## 1 Plenty
## 2 Mastering the Art of French Cooking
## 3 Balaboosta
## author cuisine yearPub ISBN
## 1 Yolam Ottolenghi and Jonathan Lovekin Vegetarian 2011 1452101248
## 2 Julia Child French 1961 375413405
## 3 Einat Admony Israeli 2013 1579655009