DATA_607_Week7

DATA 607 Week7 Homework

Create three files which store the book’s information in HTML, XML, and JSON formats (“books.html”, “books.xml”, and “books.json”). The attributes are Title, Authors, ISBN and Publisher.

Let’s load the required libraries in R for data analysis

library(dplyr)
library(downloader)
library(tidyverse)
library(RCurl)
library(rvest)
library(kableExtra)

Load the html into R data frame

html_df <-  as.data.frame(read_html("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/Books1.html", encoding = 'UTF-8') %>%  html_table(header = NA, trim = TRUE))   
  
kable(html_df)

Title	Authors	ISBN	Publisher
Digital Transformation	Thomas M. Siebel	978-1-948122	Rosetta Books
The Unix Programming Environment	Brian W. Kernighan,Rob Pike	0-13-937681-X	Prentice Hall
Design Patterns	Eric Gamma,Richard Helm,Ralph Johnson,John Vlissides	0-201-63361-2	Addison-Wesley

Load the xml into R data frame

library(xml2)
library(XML)

xmlfile <- read_xml("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/books.xml")

xml_df <-  xmlParse(xmlfile) %>%             #read url link for XML data into R as a list
  xmlRoot() %>%                               #get the root node of XML data
  xmlToDataFrame(stringsAsFactors = FALSE)

kable(xml_df)

Title	Authors	ISBN	Publisher
Digital Transformation	Thomas M. Siebel	978-1-948122	Rosetta Books
The Unix Programming Environment	Brian W. KernighanRob Pike	0-13-937681-X	Prentice Hall
Design Patterns	Eric GammaRichard HelmRalph JohnsonJohn Vlissides	0-201-63361-2	Addison-Wesley

Load the json into R data frame

library(jsonlite)

json_df <- fromJSON("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/books.json") %>% as.data.frame

json_df  <- setNames(json_df , c("Title","Authors","ISBN", "Publisher"))

kable(json_df)

Title	Authors	ISBN	Publisher
Digital Transformation	Thomas M. Siebel	978-1-948122	Rosetta Books
The Unix Programming Environment	Brian W. Kernighan, Rob Pike	0-13-937681-X	Prentice Hall
Design Patterns	Eric Gamma , Richard Helm , Ralph Johnson , John Vlissides	0-201-63361-2	Addison-Wesley

Compare the dataframes

HTML vs XML

The two dataframes converted from HTML file and XML file are not exactly the same. The HTML elements when parsed into R dataframe, the data under the tag was parsed and concated differently than that in the XML.

all.equal(html_df, xml_df)

## [1] "Component \"Authors\": 2 string mismatches"

HTML vs JSON

The two dataframes converted from HTML file and JSON file are also not exactly the same. The tag in the HTML elements when parsed into R dataframe have different structure in R compared to that with JSON format.

all.equal(html_df, json_df)

## [1] "Component \"Authors\": Modes: character, list"              
## [2] "Component \"Authors\": target is character, current is list"

XML vs JSON

The two dataframes converted from XML file and JSON file are also not exactly the same. The tag in the XML elements when parsed into R dataframe have different structure in R compared to that with JSON format.

all.equal(xml_df, json_df)

## [1] "Component \"Authors\": Modes: character, list"              
## [2] "Component \"Authors\": target is character, current is list"

DATA_607_Week7_Homework

Bikram Barua

10/8/2021