Load all necessary packages
library(XML)
library(jsonlite)
library(RCurl)
Set up the URLs to the html, xml and jason files
htmlLink <- getURL('https://raw.githubusercontent.com/ezaccountz/Week_7_Assignment/master/books.html')
xmlLink <- getURL('https://raw.githubusercontent.com/ezaccountz/Week_7_Assignment/master/Books.xml')
jsonLink <- getURL('https://raw.githubusercontent.com/ezaccountz/Week_7_Assignment/master/Books.json')
Phasing the data by using the functions from the loaded packages
htmlFile <- htmlParse(htmlLink)
xmlFile <- xmlParse(xmlLink)
jsonFile <- fromJSON(jsonLink)
Convert the phased data into data frames
htmlDataFrame <- as.data.frame(readHTMLTable(htmlFile))
root <- xmlRoot(xmlFile)
xmlDataFrame <- xmlToDataFrame(root)
jsonDataFrame <- as.data.frame(jsonFile)
Rename the column names for all 3 data frames so that have the same column names.
headers <- c("Title", "Authors", "Publisher", "Year of Publication", "ISBN-10", "ISBN-13")
colnames(htmlDataFrame) <- headers
colnames(xmlDataFrame) <- headers
colnames(jsonDataFrame) <- headers
Now let’s look at the 3 data frames HTML Data Frame:
htmlDataFrame
## Title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2 The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
## 3 The Art of Statistics: How to Learn from Data
## Authors Publisher
## 1 Cathy O'Neil, Rachel Schutt O'Reilly Media
## 2 Carl Shan, William Chen, Henry Wang, Max Song Data Science Bookshelf
## 3 David Spiegelhalter Basic Books
## Year of Publication ISBN-10 ISBN-13
## 1 2013 1449358659 978-1449358655
## 2 2015 0692434879 978-0692434871
## 3 2019 1541618513 978-1541618510
XML Data Frame:
xmlDataFrame
## Title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2 The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
## 3 The Art of Statistics: How to Learn from Data
## Authors Publisher
## 1 Cathy O'Neil, Rachel Schutt O'Reilly Media
## 2 Carl Shan, William Chen, Henry Wang, Max Song Data Science Bookshelf
## 3 David Spiegelhalter Basic Books
## Year of Publication ISBN-10 ISBN-13
## 1 2013 1449358659 978-1449358655
## 2 2015 0692434879 978-0692434871
## 3 2019 1541618513 978-1541618510
JASON Data Frame:
jsonDataFrame
## Title
## 1 Doing Data Science: Straight Talk from the Frontline
## 2 The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
## 3 The Art of Statistics: How to Learn from Data
## Authors Publisher
## 1 Cathy O'Neil, Rachel Schutt O'Reilly Media
## 2 Carl Shan, William Chen, Henry Wang, Max Song Data Science Bookshelf
## 3 David Spiegelhalter Basic Books
## Year of Publication ISBN-10 ISBN-13
## 1 2013 1449358659 978-1449358655
## 2 2015 0692434879 978-0692434871
## 3 2019 1541618513 978-1541618510
The data for all 3 data frames look the same. However, let’s check the classes of the columns
sapply(htmlDataFrame, class)
## Title Authors Publisher
## "factor" "factor" "factor"
## Year of Publication ISBN-10 ISBN-13
## "factor" "factor" "factor"
sapply(xmlDataFrame, class)
## Title Authors Publisher
## "factor" "factor" "factor"
## Year of Publication ISBN-10 ISBN-13
## "factor" "factor" "factor"
sapply(jsonDataFrame, class)
## Title Authors Publisher
## "character" "character" "character"
## Year of Publication ISBN-10 ISBN-13
## "character" "character" "character"
The jason data frame has all columns as characters. We can manipulate the data to have the same data type. Here I covert the columns from the jason data frame to factors
jsonDataFrame2 <- jsonDataFrame
jsonDataFrame2[] <- lapply(jsonDataFrame2, factor)
sapply(jsonDataFrame2, class)
## Title Authors Publisher
## "factor" "factor" "factor"
## Year of Publication ISBN-10 ISBN-13
## "factor" "factor" "factor"
Finally we can see that the final version of all 3 data frames are identical
all.equal(htmlDataFrame, xmlDataFrame)
## [1] TRUE
all.equal(htmlDataFrame, jsonDataFrame2)
## [1] TRUE
all.equal(xmlDataFrame, jsonDataFrame2)
## [1] TRUE