IS607 - Week 9 Assignment

Note: Must have the following packages installed: rvest, SML, jsonlite, curl.

html

library(rvest)
## Warning: package 'rvest' was built under R version 3.1.3
#read url of html file
books_html <- html("http://www.2geeks.it/msda/books.html")

#scrape html table data, first row is header
html.data <- books_html %>%
  html_nodes("table") %>%
  html_table(header = TRUE)
html.data <- as.data.frame(html.data)
print(html.data)
##                                                                    Title
## 1                                 Urban Utopias in the Twentieth Century
## 2                                        The Citizen's Guide to Planning
## 3 The Death and Life of Great American Cities (50th Anniversary Edition)
##                    Author1          Author2            Author3
## 1          Fishman, Robert                                    
## 2 Duerksen, Christopher J. Dale, C. Gregory Elliott, Donald L.
## 3             Jacobs, Jane                                    
##          Subject        ISBN.13          Publisher Year.Published
## 1 Urban Planning 978-0262560238      The MIT Press           1982
## 2 Urban Planning 978-1932364651 APA Planners Press           2009
## 3 Urban Planning 978-0679644330     Modern Library           2011

xml

library(XML)
## Warning: package 'XML' was built under R version 3.1.3
## 
## Attaching package: 'XML'
## 
## The following object is masked from 'package:rvest':
## 
##     xml
#read url of xml file
uxml <- "http://www.2geeks.it/msda/books.xml"

#scrape xml data
xml.data <- xmlToDataFrame(uxml)
xml.data <- xml.data[c(1,2,7,8,3,4,5,6)] #fix column order
print(xml.data)
##                                                                    title
## 1                                 Urban Utopias in the Twentieth Century
## 2                                        The Citizen's Guide to Planning
## 3 The Death and Life of Great American Cities (50th Anniversary Edition)
##                    author1          author2            author3
## 1          Fishman, Robert             <NA>               <NA>
## 2 Duerksen, Christopher J. Dale, C. Gregory Elliott, Donald L.
## 3             Jacobs, Jane             <NA>               <NA>
##          subject        isbn-13          publisher year_published
## 1 Urban Planning 978-0262560238      The MIT Press           1982
## 2 Urban Planning 978-1932364651 APA Planners Press           2009
## 3 Urban Planning 978-0679644330     Modern Library           2011

json

library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.1.3
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
library(curl)
## Warning: package 'curl' was built under R version 3.1.3
#read url of json file
ujson <- "http://www.2geeks.it/msda/books.json"

#scrape json data
json.data <- jsonlite::fromJSON(ujson)
json.data <- json.data$books[c(1,2,7,8,3,4,5,6)] #fix column order
print(json.data)
##                                                                    title
## 1                                 Urban Utopias in the Twentieth Century
## 2                                        The Citizen's Guide to Planning
## 3 The Death and Life of Great American Cities (50th Anniversary Edition)
##                    author1          author2            author3
## 1          Fishman, Robert             <NA>               <NA>
## 2 Duerksen, Christopher J. Dale, C. Gregory Elliott, Donald L.
## 3             Jacobs, Jane             <NA>               <NA>
##          subject        isbn-13          publisher year_published
## 1 Urban Planning 978-0262560238      The MIT Press           1982
## 2 Urban Planning 978-1932364651 APA Planners Press           2009
## 3 Urban Planning 978-0679644330     Modern Library           2011

The three files are similar after being scraped from the web, but the columns in the xml and json files don’t save in the correct order. Also the XML package allows for saving directly to a data frame, while it appears that the rvest and jsonlite packages do not (unless I missed that).