Create three files which store the book’s information in HTML, XML, and JSON formats (“books.html”, “books.xml”, and “books.json”). The attributes are Title, Authors, ISBN and Publisher.
library(dplyr)
library(downloader)
library(tidyverse)
library(RCurl)
library(rvest)
library(kableExtra)
html_df <- as.data.frame(read_html("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/Books1.html", encoding = 'UTF-8') %>% html_table(header = NA, trim = TRUE))
kable(html_df)
| Title | Authors | ISBN | Publisher |
|---|---|---|---|
| Digital Transformation | Thomas M. Siebel | 978-1-948122 | Rosetta Books |
| The Unix Programming Environment | Brian W. Kernighan,Rob Pike | 0-13-937681-X | Prentice Hall |
| Design Patterns | Eric Gamma,Richard Helm,Ralph Johnson,John Vlissides | 0-201-63361-2 | Addison-Wesley |
library(xml2)
library(XML)
xmlfile <- read_xml("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/books.xml")
xml_df <- xmlParse(xmlfile) %>% #read url link for XML data into R as a list
xmlRoot() %>% #get the root node of XML data
xmlToDataFrame(stringsAsFactors = FALSE)
kable(xml_df)
| Title | Authors | ISBN | Publisher |
|---|---|---|---|
| Digital Transformation | Thomas M. Siebel | 978-1-948122 | Rosetta Books |
| The Unix Programming Environment | Brian W. KernighanRob Pike | 0-13-937681-X | Prentice Hall |
| Design Patterns | Eric GammaRichard HelmRalph JohnsonJohn Vlissides | 0-201-63361-2 | Addison-Wesley |
library(jsonlite)
json_df <- fromJSON("https://raw.githubusercontent.com/baruab/msdsrepo/main/DATA-607/books.json") %>% as.data.frame
json_df <- setNames(json_df , c("Title","Authors","ISBN", "Publisher"))
kable(json_df)
| Title | Authors | ISBN | Publisher |
|---|---|---|---|
| Digital Transformation | Thomas M. Siebel | 978-1-948122 | Rosetta Books |
| The Unix Programming Environment | Brian W. Kernighan, Rob Pike | 0-13-937681-X | Prentice Hall |
| Design Patterns | Eric Gamma , Richard Helm , Ralph Johnson , John Vlissides | 0-201-63361-2 | Addison-Wesley |
The two dataframes converted from HTML file and XML file are not exactly the same. The HTML elements when parsed into R dataframe, the data under the
all.equal(html_df, xml_df)
## [1] "Component \"Authors\": 2 string mismatches"
The two dataframes converted from HTML file and JSON file are also not exactly the same. The
all.equal(html_df, json_df)
## [1] "Component \"Authors\": Modes: character, list"
## [2] "Component \"Authors\": target is character, current is list"
The two dataframes converted from XML file and JSON file are also not exactly the same. The
all.equal(xml_df, json_df)
## [1] "Component \"Authors\": Modes: character, list"
## [2] "Component \"Authors\": target is character, current is list"