Overview: The assignment for this week is working with HTML, XML and JSON in R

Load all the required packages.

library(tidyverse)
library(RCurl)
library(rvest)
library(XML)
library(RJSONIO)

Read data from 3 manually created HTML, XML and JSON file

hfile <- "https://raw.githubusercontent.com/ferrysany/CUNY607A7/master/books.html"

#"XML" function can't read "HTTPS" directly
xfile <- getURL("https://raw.githubusercontent.com/ferrysany/CUNY607A7/master/books.xml")
jfile <- getURL("https://raw.githubusercontent.com/ferrysany/CUNY607A7/master/books.json")

Parse all three files

#1. Parse HTML. The simplest read html file is using library "rvest" which read the HTML table directly, it provides "readHhml" as a xml_document. It also provides all correct data type in each column
readHtml <- read_html(hfile)
hTables<-html_nodes(readHtml,"table")
hbook<-as.data.frame(html_table(hTables, fill = TRUE))

#2. Parse XML and convert to dataframe
xbook <- xmlToDataFrame(xmlRoot(xmlParse(xfile)),stringsAsFactors = FALSE)
xbook$item=parse_integer(xbook$item)
xbook$pages=parse_integer(xbook$pages)

#3 Parse JSON and convert to dataframe
jbook <- data.frame(sapply(fromJSON(jfile), c))

hbook
##   item                                      title
## 1    1                    A Brief History of Time
## 2    2                                  Divergent
## 3    3 Fundamentals of Engineering Thermodynamics
##                                                                        author
## 1                                                             Stephen Hawking
## 2                                                               Veronica Roth
## 3 Michael J. Moran, Howard N. Shapiro, Daisie D. Boettner, Margaret B. Bailey
##          country  publicationDate pages
## 1 United Kingdom    March 1, 1988   256
## 2  United States   April 26, 2011   487
## 3  United States December 7, 2010  1004
xbook
##   item                                      title
## 1    1                    A Brief History of Time
## 2    2                                  Divergent
## 3    3 Fundamentals of Engineering Thermodynamics
##                                                                        author
## 1                                                             Stephen Hawking
## 2                                                               Veronica Roth
## 3 Michael J. Moran, Howard N. Shapiro, Daisie D. Boettner, Margaret B. Bailey
##          country  publicationDate pages
## 1 United Kingdom    March 1, 1988   256
## 2  United States   April 26, 2011   487
## 3  United States December 7, 2010  1004
jbook
##   item                                      title
## 1    1                    A Brief History of Time
## 2    2                                  Divergent
## 3    3 Fundamentals of Engineering Thermodynamics
##                                                                        author
## 1                                                             Stephen Hawking
## 2                                                               Veronica Roth
## 3 Michael J. Moran, Howard N. Shapiro, Daisie D. Boettner, Margaret B. Bailey
##          country  publicationDate pages
## 1 United Kingdom    March 1, 1988   256
## 2  United States   April 26, 2011   487
## 3  United States December 7, 2010  1004

Check if all three dataframes are identical

identical(xbook, jbook)
## [1] FALSE
identical(xbook, hbook)
## [1] TRUE

xbook and jbook are not identical as all columns in jbook are factors and incorrect. xbook and hbook are identical. I need to parse the columns in jbook back to the correct data type (as like what I did for some of the columns in xbook)