installed the packages below to be able to properly work with xml,json, html data
install.packages(“xml2”) install.packages(“jsonlite”)
Picked three of your favorite books on one of your favorite subjects. At least one of the books has more than one author. For each book, included the title, authors, language and release year
#load the target data
booksdata<-read.csv("https://raw.githubusercontent.com/johnm1990/DATA607/master/Booklist10.csv", stringsAsFactors = FALSE)
#show target data
datatable(booksdata)
HTML- import of book data in HTML format to parse
library(rvest)
library(conflicted)
#read in RAW HTML data from online Github repo
RawHTML1 <- read_html("https://raw.githubusercontent.com/johnm1990/DATA607/master/booktitles11.html")
#parse file from imported rawHTML
html_parsed1 <- htmlParse(RawHTML1)
#create data frame
html_table<-readHTMLTable(html_parsed1, stringsAsFactors = FALSE)
html_table<-html_table[[1]]
#view dataframe
datatable(html_table)
XML - import of book data in XML format to parse
library(XML)
library(conflicted)
#read in RAW XML data from online Github repo
RawXML2 <- read_xml("https://raw.githubusercontent.com/johnm1990/DATA607/master/booktitles10.xml")
#parse file from imported rawXML
xml_parsed2<-xmlParse(RawXML2)
#create data frame
xml_table2<-xmlToDataFrame(xml_parsed2, stringsAsFactors = FALSE)
#view dataframe
datatable(xml_table2)
JSON - import of book data in JSON format to parse
#load packages
library(jsonlite)
library(rvest)
#read in RAW JSON data from online Github repo
RawJSON1<-read_json("https://raw.githubusercontent.com/johnm1990/DATA607/master/booktitles10.json")
#create data frame
json_table<-do.call("rbind", lapply(RawJSON1[[1]], data.frame, stringsAsFactors = FALSE))
#view dataframe
datatable(json_table)
#The two dataframes are almost identical.
all.equal(html_table,json_table)
## [1] "Names: 2 string mismatches"
## [2] "Component \"Pages\": Modes: character, numeric"
## [3] "Component \"Pages\": target is character, current is numeric"
#The two dataframes are identical if the data type of pages is converted to integer.
all.equal(as.integer(html_table$Pages), json_table$Pages)
## [1] TRUE
#The two dataframes are almost identical.
all.equal(xml_table2,json_table)
## [1] "Names: 2 string mismatches"
## [2] "Component \"Pages\": Modes: character, numeric"
## [3] "Component \"Pages\": target is character, current is numeric"
#The two dataframes are identical if the data type of pages is converted to integer.
all.equal(as.integer(xml_table2$Pages), json_table$Pages)
## [1] TRUE
# appears the all.equal function is detecting some sort of name string mismatch in the writing of code between HTML and XML. This is may be minor detection in the creation of the string
all.equal(html_table,xml_table2)
## [1] "Names: 2 string mismatches"
The Data frames between HTML, JSON and XML are for the most part identical. It is important to point out that at first certain components such as ‘pages’ column was reading in as character, but with a ‘as.integer’ implementation, the result was identical.