DATA 607 - Assignment 8

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats.

Instead of having a book with multiple authors, I chose an element (genres) that has multiple values.

HTML

load_html <- read_html(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.html"))

html_table <- load_html %>%
  html_element("table") %>%
  html_table()

html_table %>% 
  mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>% 
  knitr::kable()

title	author	release_year	genres	pages	copies_sold_worldwide
Harry Potter and the Half-Blood Prince	J.K Rowling	2005	Action, Fantasy	652	65,000,000
The Shining	Stephen King	1977	Horror, Mystery	688	700,000
House of Gucci: A True Story of Murder, Madness, Glamour, and Greed	Sara Gay Forden	2004	Drama, Crime, Biography	288	NA

XML

xml_data <- read_xml(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.xml"))

Looking at the structure of the XML to get the column names:

xml_structure(xml_data)

Extracting the data and combining into a data frame:

titles <- xml_data %>%
  xml_find_all("//title") %>%
  xml_text()

authors <- xml_data %>%
  xml_find_all("//author") %>%
  xml_text()

release_years <- xml_data %>%
  xml_find_all("//release_year") %>%
  xml_text()

genres <- xml_data %>%
  xml_find_all("//genres") %>%
  xml_text()

pages <- xml_data %>%
  xml_find_all("//pages") %>%
  xml_text()

copies_sold_worldwide <- xml_data %>%
  xml_find_all("//copies_sold_worldwide") %>%
  xml_text()

xml_table <- data.frame("title" = titles,
                        "author" = authors,
                        "release_year" = release_years,
                        "genres" = genres,
                        "pages" = pages,
                        "copies_sold_worldwide" = copies_sold_worldwide)

xml_table$release_year <- as.integer(xml_table$release_year)
xml_table$pages <- as.integer(xml_table$pages)
xml_table$copies_sold_worldwide <- as.integer(xml_table$copies_sold_worldwide)

xml_table %>% 
  mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>% 
  knitr::kable()

title	author	release_year	genres	pages	copies_sold_worldwide
Harry Potter and the Half-Blood Prince	J.K Rowling	2005	Action, Fantasy	652	65,000,000
The Shining	Stephen King	1977	Horror, Mystery	688	700,000
House of Gucci: A True Story of Murder, Madness, Glamour, and Greed	Sara Gay Forden	2004	Drama, Crime, Biography	288	NA

JSON

json_table <- fromJSON(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.json"))

json_table %>% 
  mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>% 
  knitr::kable()

title	author	release_year	genres	pages	copies_sold_worldwide
Harry Potter and the Half-Blood Prince	J.K. Rowling	2005	Action, Fantasy	652	65,000,000
The Shining	Stephen King	1977	Horror, Mystery	688	700,000
House of Gucci: A True Story of Murder, Madness, Glamour, and Greed	Sara Gay Forden	2004	Drama, Crime, Biography	288	NA

The data from HTML and JSON read in exactly the same. For the XML, I needed to cast the numeric columns to integer type, as they read in as characters. Once I did that, the tables were identical.

DATA 607 - Assignment 8

Shoshana Farber

March 12, 2023

HTML

XML

JSON