Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Parsing HTML

# load the library
library(XML)
library(jsonlite)
library(DT)
library(RCurl)
## Loading required package: bitops
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
url<- getURL("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/book.html")
html <- readHTMLTable(url, header = TRUE)
html
## $`NULL`
##             Title                     Author(s)
## 1   War and Peace                   Leo Tolstoy
## 2 The Kite Runner               Khaled Hosseini
## 3      Good Omens Terry Pratchett & Neil Gaiman
##                              Genre Year        Country
## 1                 Historical novel 1869         Russia
## 2 Historical fiction/Drama/Classic 2003  United States
## 3                           Horror 1990 United Kingdom

Parsing XML

url2 = getURL("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Book3.xml")

#parse the bond file
bond <- xmlParse(url2)
#change it to list
list <- xmlToList(bond)
#change it to data fram
xml <- ldply(list, data.frame)

#slect the columns we need
xml<-select(xml,Title,Authors,Genre,Year,Country)
xml
##             Title                         Authors
## 1   War and Peace                     Leo Tolstoy
## 2 The Kite Runner                 Khaled Hosseini
## 3      Good Omens  Terry Pratchett & Neil Gaiman 
##                              Genre Year        Country
## 1                 Historical novel 1869         Russia
## 2 Historical fiction/Drama/Classic 2003  United States
## 3                           Horror 1990 United Kingdom

Parsing JSON

url3 <- getURL("https://raw.githubusercontent.com/DaisyCai2019/Homework/master/Book2.json")
json <- fromJSON(url3)
json
## $Book2
##             Title                       Authors
## 1   War and Peace                   Leo Tolstoy
## 2 The Kite Runner               Khaled Hosseini
## 3      Good Omens Terry Pratchett & Neil Gaiman
##                              Genre Year        Country
## 1                 Historical novel 1869         Russia
## 2 Historical fiction/Drama/Classic 2003  United States
## 3                           Horror 1990 United Kingdom

Conclusion

When I load three sources into R, HTML and JSON format is easier than XML. I can copy the url, and then use readHTMLTable() and fromJSON () to load the data into R. For the XML, I need to parse the file, change it to a list and then to a frame. Although three tables are very similar, they are not the same. The data frame from HTML doesn’t include index number, but JSON format automatically has index numbers when I load the data into R. In addition, when I create XML file, I can’t use “&”directly for second authors and have to add amp;. The method I use to load the XML data also create some unnecessary columns like .id and .attrs, so I have to manually select those columns I want.