Abstract

In the asssignment, I created a list of books stored in an HTML table, an XML file, and a JSON file and use R to read and parse the files and store in 3 data frames. We should then review the resulting data frames and note any differences.

Environment Prep

if (!require('rvest')) install.packages('rvest')
if (!require('XML')) install.packages('XML')
if (!require('jsonlite')) install.packages('jsonlite')
if (!require('RCurl')) install.packages('RCurl')
if (!require('rlist')) install.packages('rlist')
if (!require('magrittr')) install.packages('magrittr')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('RJSONIO')) install.packages('RJSONIO')
if (!require('DT')) install.packages('DT')

Importing

HTML Table

html_link <- getURL("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/week_7/books.html") %>% 
  read_html() %>% 
  html_nodes( xpath="//table")

df_html <-html_table((html_link))[[1]]
datatable(df_html)

XML

library(xml2)
xml_link <- getURL("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/week_7/books.xml")
xml_parsed <-xmlParse(xml_link)


xml_root <- xmlRoot(xml_parsed)

df_xml <- xmlToDataFrame(xml_root, stringsAsFactors = FALSE)
datatable(df_xml)

JSON

json_link <- fromJSON(getURL("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/week_7/books.json"))
temp_1 <- as.data.frame(json_link, stringsAsFactors = FALSE)[1,1:6]
temp_2 <- as.data.frame(json_link, stringsAsFactors = FALSE)[1,7:12]
temp_3 <- as.data.frame(json_link, stringsAsFactors = FALSE)[1,13:18]

names(temp_2) <- names(temp_1)
names(temp_3) <- names(temp_1)
df_json <- rbind(temp_1,temp_2,temp_3)
datatable(df_json)

Testing Similarity

They all look equivalent to the eye, but are they? We can use the base package function ‘all.equal’ to test.

HTML to XML

all.equal(df_html, df_xml)
## [1] "Component \"year\": Modes: numeric, character"               
## [2] "Component \"year\": target is numeric, current is character" 
## [3] "Component \"pages\": Modes: numeric, character"              
## [4] "Component \"pages\": target is numeric, current is character"

HTML to JSON

all.equal(df_html, df_json)
## [1] "Names: 6 string mismatches"     "Component 3: 1 string mismatch"

XML to JSON

all.equal(df_xml, df_json)
## [1] "Names: 6 string mismatches"                          
## [2] "Component 2: Modes: character, numeric"              
## [3] "Component 2: target is character, current is numeric"
## [4] "Component 3: 1 string mismatch"                      
## [5] "Component 5: Modes: character, numeric"              
## [6] "Component 5: target is character, current is numeric"

Summary

summary(df_html)
##     title                year         author           publisher        
##  Length:3           Min.   :2011   Length:3           Length:3          
##  Class :character   1st Qu.:2013   Class :character   Class :character  
##  Mode  :character   Median :2015   Mode  :character   Mode  :character  
##                     Mean   :2014                                        
##                     3rd Qu.:2015                                        
##                     Max.   :2015                                        
##      pages         ISBN          
##  Min.   :472   Length:3          
##  1st Qu.:474   Class :character  
##  Median :476   Mode  :character  
##  Mean   :476                     
##  3rd Qu.:478                     
##  Max.   :480
summary(df_xml)
##     title               year              author         
##  Length:3           Length:3           Length:3          
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##   publisher            pages               ISBN          
##  Length:3           Length:3           Length:3          
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
summary(df_json)
##  rbooklist.title    rbooklist.year rbooklist.authors  rbooklist.publisher
##  Length:3           Min.   :2011   Length:3           Length:3           
##  Class :character   1st Qu.:2013   Class :character   Class :character   
##  Mode  :character   Median :2015   Mode  :character   Mode  :character   
##                     Mean   :2014                                         
##                     3rd Qu.:2015                                         
##                     Max.   :2015                                         
##  rbooklist.pages rbooklist.ISBN    
##  Min.   :472     Length:3          
##  1st Qu.:474     Class :character  
##  Median :476     Mode  :character  
##  Mean   :476                       
##  3rd Qu.:478                       
##  Max.   :480

Conclusion

While the characters in each of the data frames are essentially the same, but data type are different. The classes have slight differences, most notably the xml import method resulted in each variable being a characters.

reference

Jeff Littlejohn