Assignment 5 : XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Read libraries

library(tidyverse)
library(RCurl)
library(XML)
library(jsonlite)

HTML file

data_path <- "https://raw.githubusercontent.com/Naik-Khyati/json_xml/main/data/books."

html_url <- paste0(data_path,"html")
html_file <- getURL(html_url)
html_table <- readHTMLTable(html_file, which=1)
head(html_table)

##        isbn            book_title                        book_author pub_year
## 1 195153448   Classical Mythology                       Mark Morford     2002
## 2   2005018          Clara Callan Richard Bruce Wright; Rich Shapero     2001
## 3  60973129  Decision in Normandy                       Carlo D'Este     1991
## 4 399135782 The Kitchen Gods Wife                            Amy Tan     1991
## 5  61076031       Switching Goals            Mary-Kate; Ashley Olsen     2000
##              pub_comp
## 1        Zertex Media
## 2      Wye Publishing
## 3 Stirling Publishing
## 4    Packt Publishing
## 5  Aurora Metro books

XML file

xml_url <- paste0(data_path,"xml")
xml_file <- getURL(xml_url)
xml_table <- xmlToDataFrame(xml_file)
head(xml_table)

##        isbn            book_title                        book_author pub_year
## 1 195153448   Classical Mythology                       Mark Morford     2002
## 2   2005018          Clara Callan Richard Bruce Wright; Rich Shapero     2001
## 3  60973129  Decision in Normandy                       Carlo D'Este     1991
## 4 399135782 The Kitchen Gods Wife                            Amy Tan     1991
## 5  61076031       Switching Goals            Mary-Kate; Ashley Olsen     2000
##              pub_comp
## 1        Zertex Media
## 2      Wye Publishing
## 3 Stirling Publishing
## 4    Packt Publishing
## 5  Aurora Metro books

JSON file

json_url <- paste0(data_path,"json")
json_file <- fromJSON(json_url)

json_table <- as.data.frame(json_file)
head(json_table)

##   book.table.books.isbn book.table.books.book_title
## 1             195153448         Classical Mythology
## 2               2005018                Clara Callan
## 3              60973129        Decision in Normandy
## 4             399135782       The Kitchen Gods Wife
## 5              61076031             Switching Goals
##         book.table.books.book_author book.table.books.pub_year
## 1                       Mark Morford                      2002
## 2 Richard Bruce Wright; Rich Shapero                      2001
## 3                       Carlo D'Este                      1991
## 4                            Amy Tan                      1991
## 5            Mary-Kate; Ashley Olsen                      2000
##   book.table.books.pub_comp
## 1              Zertex Media
## 2            Wye Publishing
## 3       Stirling Publishing
## 4          Packt Publishing
## 5        Aurora Metro books

Compare html and xml tables

html_table==xml_table

##      isbn book_title book_author pub_year pub_comp
## [1,] TRUE       TRUE        TRUE     TRUE     TRUE
## [2,] TRUE       TRUE        TRUE     TRUE     TRUE
## [3,] TRUE       TRUE        TRUE     TRUE     TRUE
## [4,] TRUE       TRUE        TRUE     TRUE     TRUE
## [5,] TRUE       TRUE        TRUE     TRUE     TRUE

Compare html and json tables

html_table==json_table

##      isbn book_title book_author pub_year pub_comp
## [1,] TRUE       TRUE        TRUE     TRUE     TRUE
## [2,] TRUE       TRUE        TRUE     TRUE     TRUE
## [3,] TRUE       TRUE        TRUE     TRUE     TRUE
## [4,] TRUE       TRUE        TRUE     TRUE     TRUE
## [5,] TRUE       TRUE        TRUE     TRUE     TRUE

Compare json and xml tables

xml_table==json_table

##      isbn book_title book_author pub_year pub_comp
## [1,] TRUE       TRUE        TRUE     TRUE     TRUE
## [2,] TRUE       TRUE        TRUE     TRUE     TRUE
## [3,] TRUE       TRUE        TRUE     TRUE     TRUE
## [4,] TRUE       TRUE        TRUE     TRUE     TRUE
## [5,] TRUE       TRUE        TRUE     TRUE     TRUE

Conclusion

For downloading the file from the url, we use getURL command for html and xml file. Further, for reading the table as a dataframe, we used readHTMLTable command for html table and similarly for xml we used xmlToDataFrame command.

For reading json file, we used fromJSON command. Initially, the file is read as a list. So we use as.data.frame command to convert it to dataframe.

After reading the three files, we conclude that structure of the three different data frames are identical.