library(tidyverse)
library(RCurl)
library(XML)
library(jsonlite)
library(rvest)
library(xml2)

Overview

We were asked to pick three of our favorite books on one of our favorite subjects.
Basic requirements were: - at least one of the books should have more than one author - for each book, include the title, authors, and two or three other attributes that we find interesting. - Take the information that we’ve selected, and separately create three files which store the book’s information in: - HTML (using an html table) - XML - JSON
Write R code, using our packages of choice, to load the information from each of the three sources into separate R data frames.

Question
Are the three data frames identical?
Deliverable
Three source files and R code.

Steps


1. Select books: | Rank | Title | Author(s) | Year Pub | Topic(s) | |:—: |:—: |:—: |:—: |:—: | | 1 | Diet for a New America | John Robbins | 1987 | diet, health, vegetarian, vegan, animal rights | | 2 | The Third Industrial Revolution | Jeremy Rifkin | 2011 | economics, renewable energy, new energy regime, lateral thinking, digital revolution | | 3 | Another Economy is Possible | Manuel castells, Sarah Banet-Weiser, Sviatlana Hlebik, Giorgos Kallis, Sarah Pink, Kirsten Seale, Lisa J. Servon, Lana Swartz, Angelos Varvarousis | 2008 | economics, sharing economy, alternative economic practices, cooperatives, barter networks |
2. Create files in each format - This was done via RStudio IDE
3. Host files - Github was chosen to host each file

  1. Import in R
  • Write code to import each file into a valid data frame format in R
  1. Conclusion

Load data

books_html <- "https://raw.githubusercontent.com/justinm0rgan/data607/main/Assignments/wk7/books.html"
books_xml <- "https://raw.githubusercontent.com/justinm0rgan/data607/main/Assignments/wk7/books.xml"
books_json <- "https://raw.githubusercontent.com/justinm0rgan/data607/main/Assignments/wk7/books.json"

HTML

# extract link of html file
books_html <- getURL(books_html)

# create df from html table
df_html <- books_html %>% 
  readHTMLTable()

df_html
## $`NULL`
##   Rank                           Title
## 1    1          Diet for a New America
## 2    2 The Third Industrial Revolution
## 3    3     Another Economy is Possible
##                                                                                                                                            Author(s)
## 1                                                                                                                                       John Robbins
## 2                                                                                                                                      Jeremy Rifkin
## 3 Manuel castells, Sarah Banet-Weiser, Sviatlana Hlebik, Giorgos Kallis, Sarah Pink, Kirsten Seale, Lisa J. Servon, Lana Swartz, Angelos Varvarousis
##   Year Pub
## 1     1987
## 2     2011
## 3     2008
##                                                                                    Topic(s)
## 1                                            diet, health, vegetarian, vegan, animal rights
## 2      economics, renewable energy, new energy regime, lateral thinking, digital revolution
## 3 economics, sharing economy, alternative economic practices, cooperatives, barter networks

XML

# extract link of xml file
books_xml <- getURL(books_xml)

# get authors
books_xml %>%
  read_xml %>% 
  xml_find_all(xpath = "//book//author") %>% 
  xml_text()
##  [1] "John Robbins"        "Jeremy Rifken"       "Manuel Castells"    
##  [4] "Sarah Banet-Weiser"  "Sviatlana Hlebik"    "Giorgos Kallis"     
##  [7] "Sarah Pink"          "Kirsten Seale"       "Lisa J. Servon"     
## [10] "Lana Swartz"         "Angelos Varvarousis"
# get topics
books_xml %>% 
  read_xml %>% 
  xml_find_all(xpath = '//topic') %>% 
  xml_text()
##  [1] "diet"                           "vegetarian"                    
##  [3] "vegan"                          "animal rights"                 
##  [5] "economics"                      "renewable energy"              
##  [7] "new energy regime"              "lateral thinking"              
##  [9] "digital revolution"             "economics"                     
## [11] "sharing economy"                "alternative economic practices"
## [13] "cooperatives"                   "barter networks"
books_xml %>% 
  xmlParse() %>% 
  xpathSApply(path = '//book//topic')
## [[1]]
## <topic id="1">diet</topic> 
## 
## [[2]]
## <topic id="2">vegetarian</topic> 
## 
## [[3]]
## <topic id="3">vegan</topic> 
## 
## [[4]]
## <topic id="4">animal rights</topic> 
## 
## [[5]]
## <topic id="1">economics</topic> 
## 
## [[6]]
## <topic id="2">renewable energy</topic> 
## 
## [[7]]
## <topic id="3">new energy regime</topic> 
## 
## [[8]]
## <topic id="4">lateral thinking</topic> 
## 
## [[9]]
## <topic id="5">digital revolution</topic> 
## 
## [[10]]
## <topic id="1">economics</topic> 
## 
## [[11]]
## <topic id="2">sharing economy</topic> 
## 
## [[12]]
## <topic id="3">alternative economic practices</topic> 
## 
## [[13]]
## <topic id="4">cooperatives</topic> 
## 
## [[14]]
## <topic id="5">barter networks</topic>
books_xml %>% 
  xmlParse() %>% 
  xpathSApply('//book/author[position()=1]')
## [[1]]
## <author id="1">John Robbins</author> 
## 
## [[2]]
## <author id="1">Jeremy Rifken</author> 
## 
## [[3]]
## <author id="1">Manuel Castells</author>
books_parsed <- xmlParse(books_xml)
# build char vector with book names
books <- c("Diet for a New America", "The Third Industrial Revolution",
           "Another Economy is Possible")

(expQuery <- sprintf("//%s/book", books))
## [1] "//Diet for a New America/book"         
## [2] "//The Third Industrial Revolution/book"
## [3] "//Another Economy is Possible/book"
getAuthor <- function(node) {
  value <- xmlValue(node)
  book <- xmlName(xmlParent(node))
  mat <- c(books = books, value = value)
}

#as.data.frame(t(xpathSApply(books_parsed,expQuery, getAuthor)))

JSON

# extract link of json file
books_json <- getURL(books_json)

# create df from json file
books_json_df <- books_json %>% 
  fromJSON() %>% 
  as.data.frame() %>% 
  rename_all(funs(str_replace(., 'books\\.',''))) %>% 
  mutate(
    author = unlist(lapply(author, 
                           function(x) str_c(x, collapse =', ' ))),
    topic = unlist(lapply(topic,
                          function(x) str_c(x, collapse = ', '))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
books_json_df
##   rank                           title
## 1    1          Diet for a New America
## 2    2 The Third Industrial Revolution
## 3    3     Another Economy is Possible
##                                                                                                                                               author
## 1                                                                                                                                       John Robbins
## 2                                                                                                                                      Jeremy Rifken
## 3 Manuel Castells, Sarah Banet-Weiser, Sviatlana Hlebik, Giorgos Kallis, Sarah Pink, Kirsten Seale, Lisa J. Servon, Lana Swartz, Angelos Varvarousis
##   year
## 1 1987
## 2 2011
## 3 2008
##                                                                                       topic
## 1                                            diet, health, vegetarian, vegan, animal rights
## 2      economics, renewable energy, new energy regime, lateral thinking, digital revolution
## 3 economics, sharing economy, alternative economic practices, cooperatives, barter networks

Conclusion

HTML and JSON data frames both have 5 columns. JSON took a bit more of effort. I was unable to convert the XML file into a data frame. I tried the technique taught in the text, but couldn’t quite get it correct.