HTML, JSON and XML

This assignment involves reading three files in different formats into an R markdown file. The files were created manually for this exercise, although on the web, they probably exist in a raw form.

The chosen books are below:

chosen_books <- tibble(
  Title = c("Take the Cannoli: Stories from the New World", "Gig: Americans Talk about Their Jobs", "Kitchen Confidential: Adventures in the Culinary Underbelly"),
  Author= c("Sarah Vowell","John Bowe, Marisa Bowe & Sabin Streeter", "Anthony Bourdain"),
  Format= c("HTML", "XML", "JSON")
)

kable(chosen_books) %>%
  kable_styling(latex_options = "scale_down")
Title Author Format
Take the Cannoli: Stories from the New World Sarah Vowell HTML
Gig: Americans Talk about Their Jobs John Bowe, Marisa Bowe & Sabin Streeter XML
Kitchen Confidential: Adventures in the Culinary Underbelly Anthony Bourdain JSON

Each book will have the following attributes:

HTML - Take the Cannoli: Stories from the New World

This section uses the textreader package to parse the HTML.

# Read the HTML from Github repository
url_html <- "https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/take_the_cannoli.html"
books_html <- read_html(url_html)

# Use the xpath selector for the ID tag ("#books) and pipe to the html_table function
book_html <- books_html %>% 
  html_elements("#books") %>% 
  html_table() 


kable(book_html) %>%
  kable_styling(latex_options = "scale_down")
Title Author ISBN Pages Publisher Date
Take the Cannoli: Stories from the New World Sarah Vowell 0684867974, 0743205405 219 pages : illustrations ; 25 cm New York : Simon & Schuster 2000
# remove this package, it conflicts with other xml readers
detach("package:textreadr", unload=TRUE)

XML - Gig: Americans Talk about Their Jobs

Using the XML and xml2 packages, we can parse XML documents. In this case, there are multiple authors.

library(XML)
library(xml2)

book_url <- 'https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/gig.xml'

data <- read_xml(book_url)
doc <- xmlParse(data)
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//book"))
kable(df) %>%
  kable_styling(latex_options = "scale_down")
title authors isbn pages publisher date
Gig: Americans Talk about Their Jobs Marisa BoweJohn BoweSabin Streeter 0609807072 672 pages Three Rivers Press 2001

JSON - Kitchen Confidential: Adventures in the Culinary Underbelly

We parse a JSON document containing multiple ISBNs.

Note: there are two rows since there are two ISBN’s, the dataframe has two rows.

book_json <- as_tibble(jsonlite::fromJSON("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/kitchen_confidential.json"))

kable(book_json) %>%
  kable_styling(latex_options = "scale_down")
title author isbn pages publisher date
Kitchen Confidential: Adventures in the Culinary Underbelly Anthony Bourdain 0060899220 312, 22 pages Harper Perennial 2007
Kitchen Confidential: Adventures in the Culinary Underbelly Anthony Bourdain 9780060899226 312, 22 pages Harper Perennial 2007