In the asssignment, I created a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames. We should then review the resulting data frames and note any differences.
if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')
if (!require('RCurl')) install.packages('RCurl')
if (!require('rlist')) install.packages('rlist')
if (!require('magrittr')) install.packages('magrittr')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('RJSONIO')) install.packages('RJSONIO')
if (!require('DT')) install.packages('DT')
html_link <- getURL("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/week_7/books.html") %>%
read_html() %>%
html_nodes( xpath="//table")
df_html <-html_table((html_link))[[1]]
datatable(df_html)
library(xml2)
xml_link <- getURL("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/week_7/books.xml")
xml_parsed <-xmlParse(xml_link)
xml_root <- xmlRoot(xml_parsed)
df_xml <- xmlToDataFrame(xml_root, stringsAsFactors = FALSE)
datatable(df_xml)
json_link <- fromJSON(getURL("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/week_7/books.json"))
temp_1 <- as.data.frame(json_link, stringsAsFactors = FALSE)[1,1:6]
temp_2 <- as.data.frame(json_link, stringsAsFactors = FALSE)[1,7:12]
temp_3 <- as.data.frame(json_link, stringsAsFactors = FALSE)[1,13:18]
names(temp_2) <- names(temp_1)
names(temp_3) <- names(temp_1)
df_json <- rbind(temp_1,temp_2,temp_3)
datatable(df_json)
They all look equivalent to the eye, but are they? We can use the base package function ‘all.equal’ to test.
all.equal(df_html, df_xml)
## [1] "Component \"year\": Modes: numeric, character"
## [2] "Component \"year\": target is numeric, current is character"
## [3] "Component \"pages\": Modes: numeric, character"
## [4] "Component \"pages\": target is numeric, current is character"
all.equal(df_html, df_json)
## [1] "Names: 6 string mismatches" "Component 3: 1 string mismatch"
all.equal(df_xml, df_json)
## [1] "Names: 6 string mismatches"
## [2] "Component 2: Modes: character, numeric"
## [3] "Component 2: target is character, current is numeric"
## [4] "Component 3: 1 string mismatch"
## [5] "Component 5: Modes: character, numeric"
## [6] "Component 5: target is character, current is numeric"
summary(df_html)
## title year author publisher
## Length:3 Min. :2011 Length:3 Length:3
## Class :character 1st Qu.:2013 Class :character Class :character
## Mode :character Median :2015 Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2015
## Max. :2015
## pages ISBN
## Min. :472 Length:3
## 1st Qu.:474 Class :character
## Median :476 Mode :character
## Mean :476
## 3rd Qu.:478
## Max. :480
summary(df_xml)
## title year author
## Length:3 Length:3 Length:3
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## publisher pages ISBN
## Length:3 Length:3 Length:3
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
summary(df_json)
## rbooklist.title rbooklist.year rbooklist.authors rbooklist.publisher
## Length:3 Min. :2011 Length:3 Length:3
## Class :character 1st Qu.:2013 Class :character Class :character
## Mode :character Median :2015 Mode :character Mode :character
## Mean :2014
## 3rd Qu.:2015
## Max. :2015
## rbooklist.pages rbooklist.ISBN
## Min. :472 Length:3
## 1st Qu.:474 Class :character
## Median :476 Mode :character
## Mean :476
## 3rd Qu.:478
## Max. :480
While the characters in each of the data frames are essentially the same, but data type are different. The classes have slight differences, most notably the xml import method resulted in each variable being a characters.
Jeff Littlejohn