607 Assignment 7 Dylan Gold

Approach

In this assignment we have to create our own html and json data from information about books. I will do this manually. Then we import them into R and compare them to each other and determine if they are identical. It seems simple enough to me. We need 3 books one with multiple authors. We also need some extra attributes.

For my books I will choose Sunrise Nights by Jeff Zentner and Brittany Cavallaro, The Lord of the Rings by J. R. R. Tolkien and Neuromancer by William Gibson. For my extra attributes I will have publisher, publication date and genre.

I can get the information on these from wikipedia or amazon.
The data for these books in title, author(s), publisher, publication date, genre are:
Sunrise Nights, Jeff Zentne; Brittany Cavallaro, Quill Tree Books, July 9, 2024, Romance
The Lord of the Rings, J. R. R. Tolkien, Allen & Unwin, July 29 1954, Fantasy
Neuromancer, William Gibson, Ace Books, July 1, 1984, Science fiction

I will likely use the R functions dplyr all_equal() or identical() to compare them.

Codebase

First I will create my files by hand. I will start off with the json data.
I will follow the format from my approach. Each of the books are stored in an overall list.

The following is a copy of my json I created
[ { “title”:“Sunrise Nights”, “author”:[“Jeff Zentne”, “Brittany Cavallaro”], “publisher”:“Quill Tree Books”, “publication_date”:“July 9, 2024”, “genre”:“Romance” }, { “title”:“The Lord of the Rings”, “author”:[“J. R. R. Tolkien”], “publisher”:“Allen & Unwin”, “publication_date”:“July 29, 1954”, “genre”:“Fantasy” }, { “title”:“Neuromancer”, “author”:[“William Gibson”], “publisher”:“Ace Books”, “publication_date”:“July 1, 1984”, “genre”:“Science fiction” }]

Now I will also create a html table for the data. The data will follow a similar format
I was less familiar with creating raw html because most web development has some framework to help. The key things with an html table are <th></th> for the headers, <tr></tr> for the rows then <td></td> for the cells.
I actually had to go back for one more thing, <ul></ul> for a list, and <li></li> for a list item

This is the table data:

title	author	publisher	publication_date	genre
Sunrise Nights	Jeff Zentne Brittany Cavallaro	Quill Tree Books	July 9, 2024	Romance
The Lord of the Rings	J. R. R. Tolkien	Allen & Unwin	July 29, 1954	Fantasy
Neuromancer	William Gibson	Ace Books	July 1, 1984	Science fiction

First I put them into my github directory.

I also need some libraries to make this easier, rvest for the html and jsonlite for the json.

library(rvest)
library(jsonlite)
library(dplyr) #Helping select for comparisons

Now we can bring them into R, to bring the html into R I will use a package called rvest

library(rvest)
html_url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-5/refs/heads/main/books.html"

html_books <- read_html(html_url) %>% html_table()
html_books <- html_books[[1]] # html_table returns a list of tibbles, we just have one we need to retrieve

Now the json file, I will use jsonlite for this

json_url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-5/refs/heads/main/books.json"

json_books <- fromJSON(json_url)

Now we can look at the tables that we created. html books needs to be converted from a tibble to a dataframe

html_books <- as.data.frame(html_books)
html_books

                  title                                    author
1        Sunrise Nights Jeff Zentne\n        \tBrittany Cavallaro
2 The Lord of the Rings                          J. R. R. Tolkien
3           Neuromancer                            William Gibson
         publisher publication_date           genre
1 Quill Tree Books     July 9, 2024         Romance
2    Allen & Unwin    July 29, 1954         Fantasy
3        Ace Books     July 1, 1984 Science fiction

json_books

                  title                          author        publisher
1        Sunrise Nights Jeff Zentne, Brittany Cavallaro Quill Tree Books
2 The Lord of the Rings                J. R. R. Tolkien    Allen & Unwin
3           Neuromancer                  William Gibson        Ace Books
  publication_date           genre
1     July 9, 2024         Romance
2    July 29, 1954         Fantasy
3     July 1, 1984 Science fiction

We can see that the data is the same other than the formating of the authors. While in the json it was properly addressed as a list of character which represents the authors, in the html file the list was combined into a string. I will try to compare them as they are now with all_equal This function returns a list of characters describing the differences.
There is an error can see that due to how the author column was created differently, they are not even the same type. They are not completely equal

all.equal(html_books, json_books)

[1] "Component \"author\": Modes: character, list"              
[2] "Component \"author\": target is character, current is list"

We can try to compare them without the author column and rownames to see how all.equal will treat it.

html_books_removed_author <- select(html_books, -"author")
json_books_removed_author <- select(json_books, -"author")
all.equal(html_books_removed_author, json_books_removed_author)

[1] TRUE

We can see other than this column they are equal. This is due to the way I created these files. If I had created the files in a way that kept them as strings like “Jeff Zentne, Cavallaro” rather than separating them this would not have been a problem.

Conclusion

In conclusion this assignment helped me understand the formating behind json and html files. I was able to create a data frame from both these formats using libraries like rvest and jsonlite. This has future applications of being used in all sorts of apis for json or for webscraping with html. These dataframes were nearly identical but due to the way json and html treats lists diffrently they were not exactly the same. Interestingly json upcasted the whole column as lists and html downcasted the list to a string. Some additions I could do to this assignment are pulling data straight from a site using an api or webscrapping and comparing data or just examining it from there.