For the HTML file, I used the rvest library and the read_html function to grab the table from the source endpoint. Once I got the HTML content, I grabbed the table within and parsed the table’s content into the books_html_df data frame.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
htmlsource <-"https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%207/books.html"html_content <-read_html(htmlsource)# Select the table and convert it# 'header = TRUE' ensures the <th> tags become your column namesbooks_html_df <- html_content %>%html_element("table") %>%html_table(header =TRUE)print(books_html_df)
# A tibble: 3 × 6
Title Authors Year Publisher Language `Page Count`
<chr> <chr> <int> <chr> <chr> <int>
1 In Pursuit of the Perfect Portf… Andrew… 2023 Princeto… English 416
2 Market Wizards: Interviews with… Jack D… 2016 Wiley English 480
3 The Super Traders Alan R… 1995 Probus P… English 259
JSON File
For the .json file, I used the jsonlite library and the fromJSON function to grab the contents of the .json file from my source endpoint and parse it into the books_json_df data frame.
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
title authors
1 In Pursuit of the Perfect Portfolio Andrew W. Lo, Stephen R. Foerster
2 Market Wizards: Interviews with Top Traders Jack D. Schwager
3 The Super Traders Alan Rubenfeld
year publisher language page_count
1 2023 Princeton University Press English 416
2 2016 Wiley English 480
3 1995 Probus Professional Pub. English 259
Comparison
In order to check if the two data frames are identical, I first used the isIdentical function.
The isIdentical function gave a FALSE result meaning the two data frames are not identical, so I used the comparison function to get additional information on the differences.
[1] "Names: 6 string mismatches"
[2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[3] "Attributes: < Component \"class\": 1 string mismatch >"
[4] "Component 2: Modes: character, list"
[5] "Component 2: target is character, current is list"
Looking at the comparison result, it looks like are 6 string mismatches so I used another library called waldo and the function within it called compare to show visual differences between the two files.
`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`: "data.frame"
`names(old)`: "Title" "Authors" "Year" "Publisher" "Language" "Page Count"
`names(new)`: "title" "authors" "year" "publisher" "language" "page_count"
`old$Title` is a character vector ('In Pursuit of the Perfect Portfolio', 'Market Wizards: Interviews with Top Traders', 'The Super Traders')
`new$Title` is absent
`old$Authors` is a character vector ('Andrew W. Lo, Stephen R. Foerster', 'Jack D. Schwager', 'Alan Rubenfeld')
`new$Authors` is absent
`old$Year` is an integer vector (2023, 2016, 1995)
`new$Year` is absent
`old$Publisher` is a character vector ('Princeton University Press', 'Wiley', 'Probus Professional Pub.')
`new$Publisher` is absent
`old$Language` is a character vector ('English', 'English', 'English')
`new$Language` is absent
`old$Page Count` is an integer vector (416, 480, 259)
`new$Page Count` is absent
`old$title` is absent
`new$title` is a character vector ('In Pursuit of the Perfect Portfolio', 'Market Wizards: Interviews with Top Traders', 'The Super Traders')
`old$authors` is absent
`new$authors` is a list
And 4 more differences ...
Looking at the results, it looks like there are some format differences between the two file types. The HTML file uses Title while the json file uses title. The authors in the HTML data frame are also stored in a character vector while the authors in the JSON dataframe are stored in a list. The information in each of the data frames may be similar but are not identical.
Conclusion
Using the various comparisons above, we can see that the two data frames are not identical even though they provide information that may look similar. This is partly due to different formatting norms in each file type. In JSON, the attributes are lower camel case which means that they start with a lowercase letter (ex. title) while in the HTML table, they start with an uppercase (ex. Title). Another difference is how the multiple authors are handled for the first book (In Pursuit of the Perfect Portfolio). In JSON, the two authors are contained in an array. In HTML, the group of authors are within a single string separated by a comma. All in all, this is an interesting assignment showing how you have to take into account different formats for different file types.