Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats.
Instead of having a book with multiple authors, I chose an element (genres) that has multiple values.
load_html <- read_html(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.html"))
html_table <- load_html %>%
html_element("table") %>%
html_table()
html_table %>%
mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>%
knitr::kable()
| title | author | release_year | genres | pages | copies_sold_worldwide |
|---|---|---|---|---|---|
| Harry Potter and the Half-Blood Prince | J.K Rowling | 2005 | Action, Fantasy | 652 | 65,000,000 |
| The Shining | Stephen King | 1977 | Horror, Mystery | 688 | 700,000 |
| House of Gucci: A True Story of Murder, Madness, Glamour, and Greed | Sara Gay Forden | 2004 | Drama, Crime, Biography | 288 | NA |
xml_data <- read_xml(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.xml"))
Looking at the structure of the XML to get the column names:
xml_structure(xml_data)
Extracting the data and combining into a data frame:
titles <- xml_data %>%
xml_find_all("//title") %>%
xml_text()
authors <- xml_data %>%
xml_find_all("//author") %>%
xml_text()
release_years <- xml_data %>%
xml_find_all("//release_year") %>%
xml_text()
genres <- xml_data %>%
xml_find_all("//genres") %>%
xml_text()
pages <- xml_data %>%
xml_find_all("//pages") %>%
xml_text()
copies_sold_worldwide <- xml_data %>%
xml_find_all("//copies_sold_worldwide") %>%
xml_text()
xml_table <- data.frame("title" = titles,
"author" = authors,
"release_year" = release_years,
"genres" = genres,
"pages" = pages,
"copies_sold_worldwide" = copies_sold_worldwide)
xml_table$release_year <- as.integer(xml_table$release_year)
xml_table$pages <- as.integer(xml_table$pages)
xml_table$copies_sold_worldwide <- as.integer(xml_table$copies_sold_worldwide)
xml_table %>%
mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>%
knitr::kable()
| title | author | release_year | genres | pages | copies_sold_worldwide |
|---|---|---|---|---|---|
| Harry Potter and the Half-Blood Prince | J.K Rowling | 2005 | Action, Fantasy | 652 | 65,000,000 |
| The Shining | Stephen King | 1977 | Horror, Mystery | 688 | 700,000 |
| House of Gucci: A True Story of Murder, Madness, Glamour, and Greed | Sara Gay Forden | 2004 | Drama, Crime, Biography | 288 | NA |
json_table <- fromJSON(url("https://raw.githubusercontent.com/ShanaFarber/cuny-sps/master/DATA_607/Assignment7/books.json"))
json_table %>%
mutate(across(copies_sold_worldwide, ~ format(.x, big.mark = ","))) %>%
knitr::kable()
| title | author | release_year | genres | pages | copies_sold_worldwide |
|---|---|---|---|---|---|
| Harry Potter and the Half-Blood Prince | J.K. Rowling | 2005 | Action, Fantasy | 652 | 65,000,000 |
| The Shining | Stephen King | 1977 | Horror, Mystery | 688 | 700,000 |
| House of Gucci: A True Story of Murder, Madness, Glamour, and Greed | Sara Gay Forden | 2004 | Drama, Crime, Biography | 288 | NA |
The data from HTML and JSON read in exactly the same. For the XML, I needed to cast the numeric columns to integer type, as they read in as characters. Once I did that, the tables were identical.