Assignment promp:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

library(jsonlite)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(XML)
library(rvest)
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(RCurl)
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
books_json <- fromJSON("https://raw.githubusercontent.com/Shayaeng/Data607/main/books.json")
books_xml_object <- xmlParse(getURL("https://raw.githubusercontent.com/Shayaeng/Data607/main/books.xml"))
books_xml <- xmlToDataFrame(books_xml_object)
books_html_raw <- read_html("https://raw.githubusercontent.com/Shayaeng/Data607/main/books.html")

table_data <- html_nodes(books_html_raw, "table") %>%
                html_table()
books_html <- as.data.frame(table_data)
(books_json)
##                  title                    series
## 1  The Gathering Storm         The Wheel of Time
## 2     The Way of Kings   The Stormlight Archives
## 3 The Name of the Wind The Kingkiller Chronicles
##                            authors goodreads release_year   genre
## 1 Robert Jordan, Brandon Sanderson       4.4         2009 fantasy
## 2                Brandon Sanderson       4.6         2010 fantasy
## 3                 Patrick Rothfuss       4.5         2007 fantasy
(books_xml)
##                  title                    series                        authors
## 1  The Gathering Storm         The Wheel of Time Robert JordanBrandon Sanderson
## 2     The Way of Kings   The Stormlight Archives              Brandon Sanderson
## 3 The Name of the Wind The Kingkiller Chronicles               Patrick Rothfuss
##   goodreads release_year   genre
## 1       4.4         2009 fantasy
## 2       4.6         2010 fantasy
## 3       4.5         2007 fantasy
(books_html)
##                     X1                        X2
## 1                title                    series
## 2  The Gathering Storm         The Wheel of Time
## 3     The Way of Kings   The Stormlight Archives
## 4 The Name of the Wind The Kingkiller Chronicles
##                                                 X3        X4           X5
## 1                                          authors goodreads release_year
## 2 Robert Jordan\n                Brandon Sanderson       4.4         2009
## 3                                Brandon Sanderson       4.6         2010
## 4                                 Patrick Rothfuss       4.5         2007
##        X6
## 1   genre
## 2 fantasy
## 3 fantasy
## 4 fantasy

By viewing the tables we can see that there are differences between the different dataframes. To make them identical, I will unnest the json file, set the first row of the html file as the column names and make the author column look the same for all three.

#unnest list and combine json rows
books_json_unnested <- books_json %>%
  unnest(authors)
books_json_unnested <- books_json_unnested %>%
  rename(authors = author)

books_json_unnested <- books_json_unnested %>%
  group_by(title, series, goodreads, release_year, genre) %>%
  summarise(authors = paste(authors, collapse = ", ")) %>%
  ungroup()
## `summarise()` has grouped output by 'title', 'series', 'goodreads',
## 'release_year'. You can override using the `.groups` argument.
#reorder the columns
books_json_unnested <- books_json_unnested[, c(1,2,6,3,4,5)]

#rename columns and get rid of extra row in html
colnames(books_html) <- books_html[1, ]
books_html <- books_html[-1, ]

#fix author cell in xml and html
books_xml[1, 3] <- gsub("Jordan", "Jordan, ", books_xml[1, 3])
books_html[1, 3] <- gsub("\\s{2,}", ", ", books_html[1, 3])

# Order books_xml
books_xml <- books_xml[order(books_xml$goodreads, decreasing = TRUE), ]

# Order books_html
books_html <- books_html[order(books_html$goodreads, decreasing = TRUE), ]

# Order books_json
books_json_unnested <- books_json_unnested[order(books_json_unnested$goodreads, decreasing = TRUE), ]
reset_row_names <- function(df) {
  rownames(df) <- NULL  # Remove existing row names
  rownames(df) <- seq_len(nrow(df))  # Assign new row names as a numeric sequence
  return(df)
}

# Applying the function to each data frame
books_xml <- reset_row_names(books_xml)
books_html <- reset_row_names(books_html)
books_json_unnested <- reset_row_names(books_json_unnested)
## Warning: Setting row names on a tibble is deprecated.
# Convert json to df from tibble
books_json <- as.data.frame(books_json_unnested)
# Check if all three data frames are identical
all_identical <- identical(books_xml, books_html) && identical(books_html, books_json)

# Display the result
print(paste("All three data frames are identical:", all_identical))
## [1] "All three data frames are identical: TRUE"

The three tables are now identical.