library(tidyverse)
library(RCurl)
library(XML)
library(jsonlite)
library(rvest)
library(xml2)
We were asked to pick three of our favorite books on one of our favorite subjects.
Basic requirements were: - at least one of the books should have more than one author - for each book, include the title, authors, and two or three other attributes that we find interesting. - Take the information that we’ve selected, and separately create three files which store the book’s information in: - HTML (using an html table) - XML - JSON
Write R code, using our packages of choice, to load the information from each of the three sources into separate R data frames.
Question
Are the three data frames identical?
Deliverable
Three source files and R code.
1. Select books: | Rank | Title | Author(s) | Year Pub | Topic(s) | |:—: |:—: |:—: |:—: |:—: | | 1 | Diet for a New America | John Robbins | 1987 | diet, health, vegetarian, vegan, animal rights | | 2 | The Third Industrial Revolution | Jeremy Rifkin | 2011 | economics, renewable energy, new energy regime, lateral thinking, digital revolution | | 3 | Another Economy is Possible | Manuel castells, Sarah Banet-Weiser, Sviatlana Hlebik, Giorgos Kallis, Sarah Pink, Kirsten Seale, Lisa J. Servon, Lana Swartz, Angelos Varvarousis | 2008 | economics, sharing economy, alternative economic practices, cooperatives, barter networks |
2. Create files in each format - This was done via RStudio
IDE
3. Host files - Github was chosen to host each file
books_html <- "https://raw.githubusercontent.com/justinm0rgan/data607/main/Assignments/wk7/books.html"
books_xml <- "https://raw.githubusercontent.com/justinm0rgan/data607/main/Assignments/wk7/books.xml"
books_json <- "https://raw.githubusercontent.com/justinm0rgan/data607/main/Assignments/wk7/books.json"
# extract link of html file
books_html <- getURL(books_html)
# create df from html table
df_html <- books_html %>%
readHTMLTable()
df_html
## $`NULL`
## Rank Title
## 1 1 Diet for a New America
## 2 2 The Third Industrial Revolution
## 3 3 Another Economy is Possible
## Author(s)
## 1 John Robbins
## 2 Jeremy Rifkin
## 3 Manuel castells, Sarah Banet-Weiser, Sviatlana Hlebik, Giorgos Kallis, Sarah Pink, Kirsten Seale, Lisa J. Servon, Lana Swartz, Angelos Varvarousis
## Year Pub
## 1 1987
## 2 2011
## 3 2008
## Topic(s)
## 1 diet, health, vegetarian, vegan, animal rights
## 2 economics, renewable energy, new energy regime, lateral thinking, digital revolution
## 3 economics, sharing economy, alternative economic practices, cooperatives, barter networks
# extract link of xml file
books_xml <- getURL(books_xml)
# get authors
books_xml %>%
read_xml %>%
xml_find_all(xpath = "//book//author") %>%
xml_text()
## [1] "John Robbins" "Jeremy Rifken" "Manuel Castells"
## [4] "Sarah Banet-Weiser" "Sviatlana Hlebik" "Giorgos Kallis"
## [7] "Sarah Pink" "Kirsten Seale" "Lisa J. Servon"
## [10] "Lana Swartz" "Angelos Varvarousis"
# get topics
books_xml %>%
read_xml %>%
xml_find_all(xpath = '//topic') %>%
xml_text()
## [1] "diet" "vegetarian"
## [3] "vegan" "animal rights"
## [5] "economics" "renewable energy"
## [7] "new energy regime" "lateral thinking"
## [9] "digital revolution" "economics"
## [11] "sharing economy" "alternative economic practices"
## [13] "cooperatives" "barter networks"
books_xml %>%
xmlParse() %>%
xpathSApply(path = '//book//topic')
## [[1]]
## <topic id="1">diet</topic>
##
## [[2]]
## <topic id="2">vegetarian</topic>
##
## [[3]]
## <topic id="3">vegan</topic>
##
## [[4]]
## <topic id="4">animal rights</topic>
##
## [[5]]
## <topic id="1">economics</topic>
##
## [[6]]
## <topic id="2">renewable energy</topic>
##
## [[7]]
## <topic id="3">new energy regime</topic>
##
## [[8]]
## <topic id="4">lateral thinking</topic>
##
## [[9]]
## <topic id="5">digital revolution</topic>
##
## [[10]]
## <topic id="1">economics</topic>
##
## [[11]]
## <topic id="2">sharing economy</topic>
##
## [[12]]
## <topic id="3">alternative economic practices</topic>
##
## [[13]]
## <topic id="4">cooperatives</topic>
##
## [[14]]
## <topic id="5">barter networks</topic>
books_xml %>%
xmlParse() %>%
xpathSApply('//book/author[position()=1]')
## [[1]]
## <author id="1">John Robbins</author>
##
## [[2]]
## <author id="1">Jeremy Rifken</author>
##
## [[3]]
## <author id="1">Manuel Castells</author>
books_parsed <- xmlParse(books_xml)
# build char vector with book names
books <- c("Diet for a New America", "The Third Industrial Revolution",
"Another Economy is Possible")
(expQuery <- sprintf("//%s/book", books))
## [1] "//Diet for a New America/book"
## [2] "//The Third Industrial Revolution/book"
## [3] "//Another Economy is Possible/book"
getAuthor <- function(node) {
value <- xmlValue(node)
book <- xmlName(xmlParent(node))
mat <- c(books = books, value = value)
}
#as.data.frame(t(xpathSApply(books_parsed,expQuery, getAuthor)))
# extract link of json file
books_json <- getURL(books_json)
# create df from json file
books_json_df <- books_json %>%
fromJSON() %>%
as.data.frame() %>%
rename_all(funs(str_replace(., 'books\\.',''))) %>%
mutate(
author = unlist(lapply(author,
function(x) str_c(x, collapse =', ' ))),
topic = unlist(lapply(topic,
function(x) str_c(x, collapse = ', '))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
books_json_df
## rank title
## 1 1 Diet for a New America
## 2 2 The Third Industrial Revolution
## 3 3 Another Economy is Possible
## author
## 1 John Robbins
## 2 Jeremy Rifken
## 3 Manuel Castells, Sarah Banet-Weiser, Sviatlana Hlebik, Giorgos Kallis, Sarah Pink, Kirsten Seale, Lisa J. Servon, Lana Swartz, Angelos Varvarousis
## year
## 1 1987
## 2 2011
## 3 2008
## topic
## 1 diet, health, vegetarian, vegan, animal rights
## 2 economics, renewable energy, new energy regime, lateral thinking, digital revolution
## 3 economics, sharing economy, alternative economic practices, cooperatives, barter networks
HTML and JSON data frames both have 5 columns. JSON took a bit more of effort. I was unable to convert the XML file into a data frame. I tried the technique taught in the text, but couldn’t quite get it correct.