Overview: The assignment for this week is working with HTML, XML and JSON in R
Load all the required packages.
library(tidyverse)
library(RCurl)
library(rvest)
library(XML)
library(RJSONIO)
Read data from 3 manually created HTML, XML and JSON file
hfile <- "https://raw.githubusercontent.com/ferrysany/CUNY607A7/master/books.html"
#"XML" function can't read "HTTPS" directly
xfile <- getURL("https://raw.githubusercontent.com/ferrysany/CUNY607A7/master/books.xml")
jfile <- getURL("https://raw.githubusercontent.com/ferrysany/CUNY607A7/master/books.json")
Parse all three files
#1. Parse HTML. The simplest read html file is using library "rvest" which read the HTML table directly, it provides "readHhml" as a xml_document. It also provides all correct data type in each column
readHtml <- read_html(hfile)
hTables<-html_nodes(readHtml,"table")
hbook<-as.data.frame(html_table(hTables, fill = TRUE))
#2. Parse XML and convert to dataframe
xbook <- xmlToDataFrame(xmlRoot(xmlParse(xfile)),stringsAsFactors = FALSE)
xbook$item=parse_integer(xbook$item)
xbook$pages=parse_integer(xbook$pages)
#3 Parse JSON and convert to dataframe
jbook <- data.frame(sapply(fromJSON(jfile), c))
hbook
## item title
## 1 1 A Brief History of Time
## 2 2 Divergent
## 3 3 Fundamentals of Engineering Thermodynamics
## author
## 1 Stephen Hawking
## 2 Veronica Roth
## 3 Michael J. Moran, Howard N. Shapiro, Daisie D. Boettner, Margaret B. Bailey
## country publicationDate pages
## 1 United Kingdom March 1, 1988 256
## 2 United States April 26, 2011 487
## 3 United States December 7, 2010 1004
xbook
## item title
## 1 1 A Brief History of Time
## 2 2 Divergent
## 3 3 Fundamentals of Engineering Thermodynamics
## author
## 1 Stephen Hawking
## 2 Veronica Roth
## 3 Michael J. Moran, Howard N. Shapiro, Daisie D. Boettner, Margaret B. Bailey
## country publicationDate pages
## 1 United Kingdom March 1, 1988 256
## 2 United States April 26, 2011 487
## 3 United States December 7, 2010 1004
jbook
## item title
## 1 1 A Brief History of Time
## 2 2 Divergent
## 3 3 Fundamentals of Engineering Thermodynamics
## author
## 1 Stephen Hawking
## 2 Veronica Roth
## 3 Michael J. Moran, Howard N. Shapiro, Daisie D. Boettner, Margaret B. Bailey
## country publicationDate pages
## 1 United Kingdom March 1, 1988 256
## 2 United States April 26, 2011 487
## 3 United States December 7, 2010 1004
Check if all three dataframes are identical
identical(xbook, jbook)
## [1] FALSE
identical(xbook, hbook)
## [1] TRUE
xbook and jbook are not identical as all columns in jbook are factors and incorrect. xbook and hbook are identical. I need to parse the columns in jbook back to the correct data type (as like what I did for some of the columns in xbook)