IS607 - Week 9 Assignment
Note: Must have the following packages installed: rvest, SML, jsonlite, curl.
html
library(rvest)
## Warning: package 'rvest' was built under R version 3.1.3
#read url of html file
books_html <- html("http://www.2geeks.it/msda/books.html")
#scrape html table data, first row is header
html.data <- books_html %>%
html_nodes("table") %>%
html_table(header = TRUE)
html.data <- as.data.frame(html.data)
print(html.data)
## Title
## 1 Urban Utopias in the Twentieth Century
## 2 The Citizen's Guide to Planning
## 3 The Death and Life of Great American Cities (50th Anniversary Edition)
## Author1 Author2 Author3
## 1 Fishman, Robert
## 2 Duerksen, Christopher J. Dale, C. Gregory Elliott, Donald L.
## 3 Jacobs, Jane
## Subject ISBN.13 Publisher Year.Published
## 1 Urban Planning 978-0262560238 The MIT Press 1982
## 2 Urban Planning 978-1932364651 APA Planners Press 2009
## 3 Urban Planning 978-0679644330 Modern Library 2011
xml
library(XML)
## Warning: package 'XML' was built under R version 3.1.3
##
## Attaching package: 'XML'
##
## The following object is masked from 'package:rvest':
##
## xml
#read url of xml file
uxml <- "http://www.2geeks.it/msda/books.xml"
#scrape xml data
xml.data <- xmlToDataFrame(uxml)
xml.data <- xml.data[c(1,2,7,8,3,4,5,6)] #fix column order
print(xml.data)
## title
## 1 Urban Utopias in the Twentieth Century
## 2 The Citizen's Guide to Planning
## 3 The Death and Life of Great American Cities (50th Anniversary Edition)
## author1 author2 author3
## 1 Fishman, Robert <NA> <NA>
## 2 Duerksen, Christopher J. Dale, C. Gregory Elliott, Donald L.
## 3 Jacobs, Jane <NA> <NA>
## subject isbn-13 publisher year_published
## 1 Urban Planning 978-0262560238 The MIT Press 1982
## 2 Urban Planning 978-1932364651 APA Planners Press 2009
## 3 Urban Planning 978-0679644330 Modern Library 2011
json
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.1.3
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
library(curl)
## Warning: package 'curl' was built under R version 3.1.3
#read url of json file
ujson <- "http://www.2geeks.it/msda/books.json"
#scrape json data
json.data <- jsonlite::fromJSON(ujson)
json.data <- json.data$books[c(1,2,7,8,3,4,5,6)] #fix column order
print(json.data)
## title
## 1 Urban Utopias in the Twentieth Century
## 2 The Citizen's Guide to Planning
## 3 The Death and Life of Great American Cities (50th Anniversary Edition)
## author1 author2 author3
## 1 Fishman, Robert <NA> <NA>
## 2 Duerksen, Christopher J. Dale, C. Gregory Elliott, Donald L.
## 3 Jacobs, Jane <NA> <NA>
## subject isbn-13 publisher year_published
## 1 Urban Planning 978-0262560238 The MIT Press 1982
## 2 Urban Planning 978-1932364651 APA Planners Press 2009
## 3 Urban Planning 978-0679644330 Modern Library 2011
The three files are similar after being scraped from the web, but the columns in the xml and json files don’t save in the correct order. Also the XML package allows for saving directly to a data frame, while it appears that the rvest and jsonlite packages do not (unless I missed that).