This assignment involves reading three files in different formats into an R markdown file. The files were created manually for this exercise, although on the web, they probably exist in a raw form.
The chosen books are below:
chosen_books <- tibble(
Title = c("Take the Cannoli: Stories from the New World", "Gig: Americans Talk about Their Jobs", "Kitchen Confidential: Adventures in the Culinary Underbelly"),
Author= c("Sarah Vowell","John Bowe, Marisa Bowe & Sabin Streeter", "Anthony Bourdain"),
Format= c("HTML", "XML", "JSON")
)
kable(chosen_books) %>%
kable_styling(latex_options = "scale_down")
| Title | Author | Format |
|---|---|---|
| Take the Cannoli: Stories from the New World | Sarah Vowell | HTML |
| Gig: Americans Talk about Their Jobs | John Bowe, Marisa Bowe & Sabin Streeter | XML |
| Kitchen Confidential: Adventures in the Culinary Underbelly | Anthony Bourdain | JSON |
Each book will have the following attributes:
This section uses the textreader package to parse the HTML.
# Read the HTML from Github repository
url_html <- "https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/take_the_cannoli.html"
books_html <- read_html(url_html)
# Use the xpath selector for the ID tag ("#books) and pipe to the html_table function
book_html <- books_html %>%
html_elements("#books") %>%
html_table()
kable(book_html) %>%
kable_styling(latex_options = "scale_down")
|
# remove this package, it conflicts with other xml readers
detach("package:textreadr", unload=TRUE)
Using the XML and xml2 packages, we can parse XML documents. In this case, there are multiple authors.
library(XML)
library(xml2)
book_url <- 'https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/gig.xml'
data <- read_xml(book_url)
doc <- xmlParse(data)
df <- xmlToDataFrame(nodes = getNodeSet(doc, "//book"))
kable(df) %>%
kable_styling(latex_options = "scale_down")
| title | authors | isbn | pages | publisher | date |
|---|---|---|---|---|---|
| Gig: Americans Talk about Their Jobs | Marisa BoweJohn BoweSabin Streeter | 0609807072 | 672 pages | Three Rivers Press | 2001 |
We parse a JSON document containing multiple ISBNs.
Note: there are two rows since there are two ISBN’s, the dataframe has two rows.
book_json <- as_tibble(jsonlite::fromJSON("https://raw.githubusercontent.com/cliftonleesps/607_acq_mgt/main/week7/kitchen_confidential.json"))
kable(book_json) %>%
kable_styling(latex_options = "scale_down")
| title | author | isbn | pages | publisher | date |
|---|---|---|---|---|---|
| Kitchen Confidential: Adventures in the Culinary Underbelly | Anthony Bourdain | 0060899220 | 312, 22 pages | Harper Perennial | 2007 |
| Kitchen Confidential: Adventures in the Culinary Underbelly | Anthony Bourdain | 9780060899226 | 312, 22 pages | Harper Perennial | 2007 |