Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
library(tidyverse)
library(RCurl)
library(XML)
library(jsonlite)
data_path <- "https://raw.githubusercontent.com/Naik-Khyati/json_xml/main/data/books."
html_url <- paste0(data_path,"html")
html_file <- getURL(html_url)
html_table <- readHTMLTable(html_file, which=1)
head(html_table)
## isbn book_title book_author pub_year
## 1 195153448 Classical Mythology Mark Morford 2002
## 2 2005018 Clara Callan Richard Bruce Wright; Rich Shapero 2001
## 3 60973129 Decision in Normandy Carlo D'Este 1991
## 4 399135782 The Kitchen Gods Wife Amy Tan 1991
## 5 61076031 Switching Goals Mary-Kate; Ashley Olsen 2000
## pub_comp
## 1 Zertex Media
## 2 Wye Publishing
## 3 Stirling Publishing
## 4 Packt Publishing
## 5 Aurora Metro books
xml_url <- paste0(data_path,"xml")
xml_file <- getURL(xml_url)
xml_table <- xmlToDataFrame(xml_file)
head(xml_table)
## isbn book_title book_author pub_year
## 1 195153448 Classical Mythology Mark Morford 2002
## 2 2005018 Clara Callan Richard Bruce Wright; Rich Shapero 2001
## 3 60973129 Decision in Normandy Carlo D'Este 1991
## 4 399135782 The Kitchen Gods Wife Amy Tan 1991
## 5 61076031 Switching Goals Mary-Kate; Ashley Olsen 2000
## pub_comp
## 1 Zertex Media
## 2 Wye Publishing
## 3 Stirling Publishing
## 4 Packt Publishing
## 5 Aurora Metro books
json_url <- paste0(data_path,"json")
json_file <- fromJSON(json_url)
json_table <- as.data.frame(json_file)
head(json_table)
## book.table.books.isbn book.table.books.book_title
## 1 195153448 Classical Mythology
## 2 2005018 Clara Callan
## 3 60973129 Decision in Normandy
## 4 399135782 The Kitchen Gods Wife
## 5 61076031 Switching Goals
## book.table.books.book_author book.table.books.pub_year
## 1 Mark Morford 2002
## 2 Richard Bruce Wright; Rich Shapero 2001
## 3 Carlo D'Este 1991
## 4 Amy Tan 1991
## 5 Mary-Kate; Ashley Olsen 2000
## book.table.books.pub_comp
## 1 Zertex Media
## 2 Wye Publishing
## 3 Stirling Publishing
## 4 Packt Publishing
## 5 Aurora Metro books
html_table==xml_table
## isbn book_title book_author pub_year pub_comp
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE
html_table==json_table
## isbn book_title book_author pub_year pub_comp
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE
xml_table==json_table
## isbn book_title book_author pub_year pub_comp
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE
For downloading the file from the url, we use getURL command for html and xml file. Further, for reading the table as a dataframe, we used readHTMLTable command for html table and similarly for xml we used xmlToDataFrame command.
For reading json file, we used fromJSON command. Initially, the file is read as a list. So we use as.data.frame command to convert it to dataframe.
After reading the three files, we conclude that structure of the three different data frames are identical.