Assignment 7 – Working with XML and JSON in R

HTML, JSON and XML

This assignment involves reading three files in different formats into an R markdown file. The files were created manually for this exercise, although on the web, they probably exist in a raw form.

The chosen books are below:

chosen_books <- tibble(
  Title = c("Take the Cannoli: Stories from the New World", "Gig: Americans Talk about Their Jobs", "Kitchen Confidential: Adventures in the Culinary Underbelly"),
  Author= c("Sarah Vowell","John Bowe, Marisa Bowe & Sabin Streeter", "Anthony Bourdain"),
  Format= c("HTML", "XML", "JSON")
)

kable(chosen_books) %>%
  kable_styling(latex_options = "scale_down")

Title	Author	Format
Take the Cannoli: Stories from the New World	Sarah Vowell	HTML
Gig: Americans Talk about Their Jobs	John Bowe, Marisa Bowe & Sabin Streeter	XML
Kitchen Confidential: Adventures in the Culinary Underbelly	Anthony Bourdain	JSON

Each book will have the following attributes:

Title
Author(s)
ISBN
Pages
Publisher
Date

HTML - Take the Cannoli: Stories from the New World

This section uses the textreader package to parse the HTML.

# Read the HTML from Github repository
url_html <- "https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/take_the_cannoli.html"
books_html <- read_html(url_html)

# Use the xpath selector for the ID tag ("#books) and pipe to the html_table function
book_html <- books_html %>% 
  html_elements("#books") %>% 
  html_table() 


kable(book_html) %>%
  kable_styling(latex_options = "scale_down")

Title	Author	ISBN	Pages	Publisher	Date
Take the Cannoli: Stories from the New World	Sarah Vowell	0684867974, 0743205405	219 pages : illustrations ; 25 cm	New York : Simon & Schuster	2000

# remove this package, it conflicts with other xml readers
detach("package:textreadr", unload=TRUE)

XML - Gig: Americans Talk about Their Jobs

Using the XML and xml2 packages, we can parse XML documents. In this case, there are multiple authors.

library(XML)
library(xml2)

book_url <- 'https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/gig.xml'

data <- read_xml(book_url)
doc <- xmlParse(data)
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//book"))
kable(df) %>%
  kable_styling(latex_options = "scale_down")

title	authors	isbn	pages	publisher	date
Gig: Americans Talk about Their Jobs	Marisa BoweJohn BoweSabin Streeter	0609807072	672 pages	Three Rivers Press	2001

JSON - Kitchen Confidential: Adventures in the Culinary Underbelly

We parse a JSON document containing multiple ISBNs.

Note: there are two rows since there are two ISBN’s, the dataframe has two rows.

book_json <- as_tibble(jsonlite::fromJSON("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/kitchen_confidential.json"))

kable(book_json) %>%
  kable_styling(latex_options = "scale_down")

title	author	isbn	pages	publisher	date
Kitchen Confidential: Adventures in the Culinary Underbelly	Anthony Bourdain	0060899220	312, 22 pages	Harper Perennial	2007
Kitchen Confidential: Adventures in the Culinary Underbelly	Anthony Bourdain	9780060899226	312, 22 pages	Harper Perennial	2007

Assignment 7 – Working with XML and JSON in R

Cliff Lee

10/10/2021

HTML, JSON and XML

HTML - Take the Cannoli: Stories from the New World

XML - Gig: Americans Talk about Their Jobs

JSON - Kitchen Confidential: Adventures in the Culinary Underbelly