Data_607_Week_7_Assignment

Load all necessary packages

library(XML)
library(jsonlite)
library(RCurl)

Set up the URLs to the html, xml and jason files

htmlLink <- getURL('https://raw.githubusercontent.com/ezaccountz/Week_7_Assignment/master/books.html')
xmlLink <- getURL('https://raw.githubusercontent.com/ezaccountz/Week_7_Assignment/master/Books.xml')
jsonLink <- getURL('https://raw.githubusercontent.com/ezaccountz/Week_7_Assignment/master/Books.json')

Phasing the data by using the functions from the loaded packages

htmlFile <- htmlParse(htmlLink)
xmlFile <- xmlParse(xmlLink)
jsonFile <- fromJSON(jsonLink)

Convert the phased data into data frames

htmlDataFrame <- as.data.frame(readHTMLTable(htmlFile))
root <- xmlRoot(xmlFile)
xmlDataFrame <- xmlToDataFrame(root)
jsonDataFrame <- as.data.frame(jsonFile)

Rename the column names for all 3 data frames so that have the same column names.

headers <- c("Title", "Authors", "Publisher", "Year of Publication", "ISBN-10", "ISBN-13")
colnames(htmlDataFrame) <- headers
colnames(xmlDataFrame) <- headers
colnames(jsonDataFrame) <- headers

Now let’s look at the 3 data frames HTML Data Frame:

htmlDataFrame

##                                                                            Title
## 1                           Doing Data Science: Straight Talk from the Frontline
## 2 The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
## 3                                  The Art of Statistics: How to Learn from Data
##                                         Authors              Publisher
## 1                   Cathy O'Neil, Rachel Schutt         O'Reilly Media
## 2 Carl Shan, William Chen, Henry Wang, Max Song Data Science Bookshelf
## 3                           David Spiegelhalter            Basic Books
##   Year of Publication    ISBN-10        ISBN-13
## 1                2013 1449358659 978-1449358655
## 2                2015 0692434879 978-0692434871
## 3                2019 1541618513 978-1541618510

XML Data Frame:

xmlDataFrame

##                                                                            Title
## 1                           Doing Data Science: Straight Talk from the Frontline
## 2 The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
## 3                                  The Art of Statistics: How to Learn from Data
##                                         Authors              Publisher
## 1                   Cathy O'Neil, Rachel Schutt         O'Reilly Media
## 2 Carl Shan, William Chen, Henry Wang, Max Song Data Science Bookshelf
## 3                           David Spiegelhalter            Basic Books
##   Year of Publication    ISBN-10        ISBN-13
## 1                2013 1449358659 978-1449358655
## 2                2015 0692434879 978-0692434871
## 3                2019 1541618513 978-1541618510

JASON Data Frame:

jsonDataFrame

##                                                                            Title
## 1                           Doing Data Science: Straight Talk from the Frontline
## 2 The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists
## 3                                  The Art of Statistics: How to Learn from Data
##                                         Authors              Publisher
## 1                   Cathy O'Neil, Rachel Schutt         O'Reilly Media
## 2 Carl Shan, William Chen, Henry Wang, Max Song Data Science Bookshelf
## 3                           David Spiegelhalter            Basic Books
##   Year of Publication    ISBN-10        ISBN-13
## 1                2013 1449358659 978-1449358655
## 2                2015 0692434879 978-0692434871
## 3                2019 1541618513 978-1541618510

The data for all 3 data frames look the same. However, let’s check the classes of the columns

sapply(htmlDataFrame, class)

##               Title             Authors           Publisher 
##            "factor"            "factor"            "factor" 
## Year of Publication             ISBN-10             ISBN-13 
##            "factor"            "factor"            "factor"

sapply(xmlDataFrame, class)

##               Title             Authors           Publisher 
##            "factor"            "factor"            "factor" 
## Year of Publication             ISBN-10             ISBN-13 
##            "factor"            "factor"            "factor"

sapply(jsonDataFrame, class)

##               Title             Authors           Publisher 
##         "character"         "character"         "character" 
## Year of Publication             ISBN-10             ISBN-13 
##         "character"         "character"         "character"

The jason data frame has all columns as characters. We can manipulate the data to have the same data type. Here I covert the columns from the jason data frame to factors

jsonDataFrame2 <- jsonDataFrame
jsonDataFrame2[] <- lapply(jsonDataFrame2, factor)
sapply(jsonDataFrame2, class)

##               Title             Authors           Publisher 
##            "factor"            "factor"            "factor" 
## Year of Publication             ISBN-10             ISBN-13 
##            "factor"            "factor"            "factor"

Finally we can see that the final version of all 3 data frames are identical

all.equal(htmlDataFrame, xmlDataFrame)

## [1] TRUE

all.equal(htmlDataFrame, jsonDataFrame2)

## [1] TRUE

all.equal(xmlDataFrame, jsonDataFrame2)

## [1] TRUE

Data_607_Week_7_Assignment

Euclid Zhang

10/12/2019