DATA 607 Week7 Homework

Create three files which store the book’s information in HTML, XML, and JSON formats (“books.html”, “books.xml”, and “books.json”). The attributes are Title, Authors, ISBN and Publisher.

Let’s load the required libraries in R for data analysis
library(dplyr)
library(downloader)
library(tidyverse)
library(RCurl)
library(rvest)
library(kableExtra)


Load the html into R data frame

html_df <-  as.data.frame(read_html("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/Books1.html", encoding = 'UTF-8') %>%  html_table(header = NA, trim = TRUE))   
  
kable(html_df)
Title Authors ISBN Publisher
Digital Transformation Thomas M. Siebel 978-1-948122 Rosetta Books
The Unix Programming Environment Brian W. Kernighan,Rob Pike 0-13-937681-X Prentice Hall
Design Patterns Eric Gamma,Richard Helm,Ralph Johnson,John Vlissides 0-201-63361-2 Addison-Wesley


Load the xml into R data frame

library(xml2)
library(XML)

xmlfile <- read_xml("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/books.xml")

xml_df <-  xmlParse(xmlfile) %>%             #read url link for XML data into R as a list
  xmlRoot() %>%                               #get the root node of XML data
  xmlToDataFrame(stringsAsFactors = FALSE)

kable(xml_df)
Title Authors ISBN Publisher
Digital Transformation Thomas M. Siebel 978-1-948122 Rosetta Books
The Unix Programming Environment Brian W. KernighanRob Pike 0-13-937681-X Prentice Hall
Design Patterns Eric GammaRichard HelmRalph JohnsonJohn Vlissides 0-201-63361-2 Addison-Wesley


Load the json into R data frame

library(jsonlite)

json_df <- fromJSON("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/books.json") %>% as.data.frame

json_df  <- setNames(json_df , c("Title","Authors","ISBN", "Publisher"))

kable(json_df)
Title Authors ISBN Publisher
Digital Transformation Thomas M. Siebel 978-1-948122 Rosetta Books
The Unix Programming Environment Brian W. Kernighan, Rob Pike 0-13-937681-X Prentice Hall
Design Patterns Eric Gamma , Richard Helm , Ralph Johnson , John Vlissides 0-201-63361-2 Addison-Wesley


Compare the dataframes

HTML vs XML

The two dataframes converted from HTML file and XML file are not exactly the same. The HTML elements when parsed into R dataframe, the data under the tag was parsed and concated differently than that in the XML.

all.equal(html_df, xml_df)
## [1] "Component \"Authors\": 2 string mismatches"


HTML vs JSON

The two dataframes converted from HTML file and JSON file are also not exactly the same. The tag in the HTML elements when parsed into R dataframe have different structure in R compared to that with JSON format.

all.equal(html_df, json_df)
## [1] "Component \"Authors\": Modes: character, list"              
## [2] "Component \"Authors\": target is character, current is list"


XML vs JSON

The two dataframes converted from XML file and JSON file are also not exactly the same. The tag in the XML elements when parsed into R dataframe have different structure in R compared to that with JSON format.

all.equal(xml_df, json_df)
## [1] "Component \"Authors\": Modes: character, list"              
## [2] "Component \"Authors\": target is character, current is list"