Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
# Loading of required libraries for this assignment
library(DT)
library(stringr)
library(XML)
library(RCurl)
library(jsonlite)
I have chosen 4 different well known books for R- programming.
# Reading of html file from Github
html_parsed<-getURLContent("https://raw.githubusercontent.com/petferns/607-week7/main/book.html")
#create data frame of the parsed html
html<-readHTMLTable(html_parsed, stringsAsFactors = FALSE)
html<-html[[1]]
# View the html table data in a datatable
datatable(html)
# View the summary
summary(html)
## title authors publisher year
## Length:4 Length:4 Length:4 Length:4
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## pages
## Length:4
## Class :character
## Mode :character
# Read the XML file from Github
xml_parsed<-getURL("https://raw.githubusercontent.com/petferns/607-week7/main/book.xml")
#create data frame
xml_parsed <- xmlParse(xml_parsed)
xml<-xmlToDataFrame(xml_parsed, stringsAsFactors = FALSE)
#Data viewing
datatable(xml)
#Summary
summary(xml)
## title authors publisher year
## Length:4 Length:4 Length:4 Length:4
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## pages
## Length:4
## Class :character
## Mode :character
#Read the JSON file stored in Github
JSON_parsed <- fromJSON("https://raw.githubusercontent.com/petferns/607-week7/main/book.json")
# Create a dataframe of the json
json <- JSON_parsed[[1]]
json <- as.data.frame(json)
#View json table data in a datatable
datatable(json)
#Summary
summary(json)
## title authors.Length authors.Class authors.Mode
## Length:4 2 -none- character
## Class :character 4 -none- character
## Mode :character 1 -none- character
## 2 -none- character
##
##
## publisher year pages
## Length:4 Min. :2014 Min. :377.0
## Class :character 1st Qu.:2014 1st Qu.:433.2
## Mode :character Median :2014 Median :486.0
## Mean :2015 Mean :477.2
## 3rd Qu.:2016 3rd Qu.:530.0
## Max. :2017 Max. :560.0
We see from the above datatables the view of all the three tables are similar.
Class of “title” column is matching in json and xml.
class(xml$title)
## [1] "character"
class(json$title)
## [1] "character"
In the json table we stored authors in an array and in html its just character, so lets see if we can find the difference.
class(html$authors)
## [1] "character"
class(json$authors)
## [1] "list"
We see the class of authors column is list in JSON and character in HTML.