library(rvest)
library(jsonlite)
library(dplyr) #Helping select for comparisons607 Assignment 7 Dylan Gold
Approach
In this assignment we have to create our own html and json data from information about books. I will do this manually. Then we import them into R and compare them to each other and determine if they are identical. It seems simple enough to me. We need 3 books one with multiple authors. We also need some extra attributes.
For my books I will choose Sunrise Nights by Jeff Zentner and Brittany Cavallaro, The Lord of the Rings by J. R. R. Tolkien and Neuromancer by William Gibson. For my extra attributes I will have publisher, publication date and genre.
I can get the information on these from wikipedia or amazon.
The data for these books in title, author(s), publisher, publication date, genre are:
Sunrise Nights, Jeff Zentne; Brittany Cavallaro, Quill Tree Books, July 9, 2024, Romance
The Lord of the Rings, J. R. R. Tolkien, Allen & Unwin, July 29 1954, Fantasy
Neuromancer, William Gibson, Ace Books, July 1, 1984, Science fiction
I will likely use the R functions dplyr all_equal() or identical() to compare them.
Codebase
First I will create my files by hand. I will start off with the json data.
I will follow the format from my approach. Each of the books are stored in an overall list.
The following is a copy of my json I created
[ { “title”:“Sunrise Nights”, “author”:[“Jeff Zentne”, “Brittany Cavallaro”], “publisher”:“Quill Tree Books”, “publication_date”:“July 9, 2024”, “genre”:“Romance” }, { “title”:“The Lord of the Rings”, “author”:[“J. R. R. Tolkien”], “publisher”:“Allen & Unwin”, “publication_date”:“July 29, 1954”, “genre”:“Fantasy” }, { “title”:“Neuromancer”, “author”:[“William Gibson”], “publisher”:“Ace Books”, “publication_date”:“July 1, 1984”, “genre”:“Science fiction” }]
Now I will also create a html table for the data. The data will follow a similar format
I was less familiar with creating raw html because most web development has some framework to help. The key things with an html table are <th></th> for the headers, <tr></tr> for the rows then <td></td> for the cells.
I actually had to go back for one more thing, <ul></ul> for a list, and <li></li> for a list item
| title | author | publisher | publication_date | genre |
|---|---|---|---|---|
| Sunrise Nights |
|
Quill Tree Books | July 9, 2024 | Romance |
| The Lord of the Rings | J. R. R. Tolkien | Allen & Unwin | July 29, 1954 | Fantasy |
| Neuromancer | William Gibson | Ace Books | July 1, 1984 | Science fiction |
First I put them into my github directory.
I also need some libraries to make this easier, rvest for the html and jsonlite for the json.
Now we can bring them into R, to bring the html into R I will use a package called rvest
library(rvest)
html_url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-5/refs/heads/main/books.html"
html_books <- read_html(html_url) %>% html_table()
html_books <- html_books[[1]] # html_table returns a list of tibbles, we just have one we need to retrieveNow the json file, I will use jsonlite for this
json_url <- "https://raw.githubusercontent.com/DylanGoldJ/607-Assignment-5/refs/heads/main/books.json"
json_books <- fromJSON(json_url)Now we can look at the tables that we created. html books needs to be converted from a tibble to a dataframe
html_books <- as.data.frame(html_books)
html_books title author
1 Sunrise Nights Jeff Zentne\n \tBrittany Cavallaro
2 The Lord of the Rings J. R. R. Tolkien
3 Neuromancer William Gibson
publisher publication_date genre
1 Quill Tree Books July 9, 2024 Romance
2 Allen & Unwin July 29, 1954 Fantasy
3 Ace Books July 1, 1984 Science fiction
json_books title author publisher
1 Sunrise Nights Jeff Zentne, Brittany Cavallaro Quill Tree Books
2 The Lord of the Rings J. R. R. Tolkien Allen & Unwin
3 Neuromancer William Gibson Ace Books
publication_date genre
1 July 9, 2024 Romance
2 July 29, 1954 Fantasy
3 July 1, 1984 Science fiction
We can see that the data is the same other than the formating of the authors. While in the json it was properly addressed as a list of character which represents the authors, in the html file the list was combined into a string. I will try to compare them as they are now with all_equal This function returns a list of characters describing the differences.
There is an error can see that due to how the author column was created differently, they are not even the same type. They are not completely equal
all.equal(html_books, json_books)[1] "Component \"author\": Modes: character, list"
[2] "Component \"author\": target is character, current is list"
We can try to compare them without the author column and rownames to see how all.equal will treat it.
html_books_removed_author <- select(html_books, -"author")
json_books_removed_author <- select(json_books, -"author")
all.equal(html_books_removed_author, json_books_removed_author)[1] TRUE
We can see other than this column they are equal. This is due to the way I created these files. If I had created the files in a way that kept them as strings like “Jeff Zentne, Cavallaro” rather than separating them this would not have been a problem.
Conclusion
In conclusion this assignment helped me understand the formating behind json and html files. I was able to create a data frame from both these formats using libraries like rvest and jsonlite. This has future applications of being used in all sorts of apis for json or for webscraping with html. These dataframes were nearly identical but due to the way json and html treats lists diffrently they were not exactly the same. Interestingly json upcasted the whole column as lists and html downcasted the list to a string. Some additions I could do to this assignment are pulling data straight from a site using an api or webscrapping and comparing data or just examining it from there.