Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats.

Instead of having a book with multiple authors, I chose an element (genres) that has multiple values.

HTML

load_html <- read_html(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.html"))

html_table <- load_html %>%
  html_element("table") %>%
  html_table()

html_table %>% 
  mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>% 
  knitr::kable()
title author release_year genres pages copies_sold_worldwide
Harry Potter and the Half-Blood Prince J.K Rowling 2005 Action, Fantasy 652 65,000,000
The Shining Stephen King 1977 Horror, Mystery 688 700,000
House of Gucci: A True Story of Murder, Madness, Glamour, and Greed Sara Gay Forden 2004 Drama, Crime, Biography 288 NA

XML

xml_data <- read_xml(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.xml"))

Looking at the structure of the XML to get the column names:

xml_structure(xml_data)

Extracting the data and combining into a data frame:

titles <- xml_data %>%
  xml_find_all("//title") %>%
  xml_text()

authors <- xml_data %>%
  xml_find_all("//author") %>%
  xml_text()

release_years <- xml_data %>%
  xml_find_all("//release_year") %>%
  xml_text()

genres <- xml_data %>%
  xml_find_all("//genres") %>%
  xml_text()

pages <- xml_data %>%
  xml_find_all("//pages") %>%
  xml_text()

copies_sold_worldwide <- xml_data %>%
  xml_find_all("//copies_sold_worldwide") %>%
  xml_text()

xml_table <- data.frame("title" = titles,
                        "author" = authors,
                        "release_year" = release_years,
                        "genres" = genres,
                        "pages" = pages,
                        "copies_sold_worldwide" = copies_sold_worldwide)

xml_table$release_year <- as.integer(xml_table$release_year)
xml_table$pages <- as.integer(xml_table$pages)
xml_table$copies_sold_worldwide <- as.integer(xml_table$copies_sold_worldwide)

xml_table %>% 
  mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>% 
  knitr::kable()
title author release_year genres pages copies_sold_worldwide
Harry Potter and the Half-Blood Prince J.K Rowling 2005 Action, Fantasy 652 65,000,000
The Shining Stephen King 1977 Horror, Mystery 688 700,000
House of Gucci: A True Story of Murder, Madness, Glamour, and Greed Sara Gay Forden 2004 Drama, Crime, Biography 288 NA

JSON

json_table <- fromJSON(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.json"))

json_table %>% 
  mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>% 
  knitr::kable()
title author release_year genres pages copies_sold_worldwide
Harry Potter and the Half-Blood Prince J.K. Rowling 2005 Action, Fantasy 652 65,000,000
The Shining Stephen King 1977 Horror, Mystery 688 700,000
House of Gucci: A True Story of Murder, Madness, Glamour, and Greed Sara Gay Forden 2004 Drama, Crime, Biography 288 NA

The data from HTML and JSON read in exactly the same. For the XML, I needed to cast the numeric columns to integer type, as they read in as characters. Once I did that, the tables were identical.