Assignment 7

Author

Long Lin

Overview

For this assignment, I choose 3 books on investing.


In Pursuit of the Perfect Portfolio By Andrew W. Lo and Stephen R. Foerster

Market Wizards: Interviews with Top Traders By Jack D. Schwager

The Super Traders By Alan Rubenfeld


First I created both the HTML and json files of the book’s information manually by hand. Then I uploaded both of them onto my GitHub.

HTML file source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%207/books.html

JSON file source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%207/books.json

HTML File

For the HTML file, I used the rvest library and the read_html function to grab the table from the source endpoint. Once I got the HTML content, I grabbed the table within and parsed the table’s content into the books_html_df data frame.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
htmlsource <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%207/books.html"
html_content <- read_html(htmlsource)

# Select the table and convert it
# 'header = TRUE' ensures the <th> tags become your column names
books_html_df <- html_content %>%
  html_element("table") %>%
  html_table(header = TRUE)

print(books_html_df)
# A tibble: 3 × 6
  Title                            Authors  Year Publisher Language `Page Count`
  <chr>                            <chr>   <int> <chr>     <chr>           <int>
1 In Pursuit of the Perfect Portf… Andrew…  2023 Princeto… English           416
2 Market Wizards: Interviews with… Jack D…  2016 Wiley     English           480
3 The Super Traders                Alan R…  1995 Probus P… English           259

JSON File

For the .json file, I used the jsonlite library and the fromJSON function to grab the contents of the .json file from my source endpoint and parse it into the books_json_df data frame.

library(jsonlite)

Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':

    flatten
jsonsource <- "https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%207/books.json"
books_json_df <- fromJSON(jsonsource)

print(books_json_df)
                                        title                           authors
1         In Pursuit of the Perfect Portfolio Andrew W. Lo, Stephen R. Foerster
2 Market Wizards: Interviews with Top Traders                  Jack D. Schwager
3                           The Super Traders                    Alan Rubenfeld
  year                  publisher language page_count
1 2023 Princeton University Press  English        416
2 2016                      Wiley  English        480
3 1995   Probus Professional Pub.  English        259

Comparison

In order to check if the two data frames are identical, I first used the isIdentical function.

isIdentical <- identical(books_html_df, books_json_df)

isIdentical
[1] FALSE

The isIdentical function gave a FALSE result meaning the two data frames are not identical, so I used the comparison function to get additional information on the differences.

comparison <- all.equal(books_html_df, books_json_df)
print(comparison)
[1] "Names: 6 string mismatches"                                                            
[2] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[3] "Attributes: < Component \"class\": 1 string mismatch >"                                
[4] "Component 2: Modes: character, list"                                                   
[5] "Component 2: target is character, current is list"                                     

Looking at the comparison result, it looks like are 6 string mismatches so I used another library called waldo and the function within it called compare to show visual differences between the two files.

library(waldo)

compare(books_html_df, books_json_df)
`class(old)`: "tbl_df" "tbl" "data.frame"
`class(new)`:                "data.frame"

`names(old)`: "Title" "Authors" "Year" "Publisher" "Language" "Page Count"
`names(new)`: "title" "authors" "year" "publisher" "language" "page_count"

`old$Title` is a character vector ('In Pursuit of the Perfect Portfolio', 'Market Wizards: Interviews with Top Traders', 'The Super Traders')
`new$Title` is absent

`old$Authors` is a character vector ('Andrew W. Lo, Stephen R. Foerster', 'Jack D. Schwager', 'Alan Rubenfeld')
`new$Authors` is absent

`old$Year` is an integer vector (2023, 2016, 1995)
`new$Year` is absent

`old$Publisher` is a character vector ('Princeton University Press', 'Wiley', 'Probus Professional Pub.')
`new$Publisher` is absent

`old$Language` is a character vector ('English', 'English', 'English')
`new$Language` is absent

`old$Page Count` is an integer vector (416, 480, 259)
`new$Page Count` is absent

`old$title` is absent
`new$title` is a character vector ('In Pursuit of the Perfect Portfolio', 'Market Wizards: Interviews with Top Traders', 'The Super Traders')

`old$authors` is absent
`new$authors` is a list

And 4 more differences ...

Looking at the results, it looks like there are some format differences between the two file types. The HTML file uses Title while the json file uses title. The authors in the HTML data frame are also stored in a character vector while the authors in the JSON dataframe are stored in a list. The information in each of the data frames may be similar but are not identical.

Conclusion

Using the various comparisons above, we can see that the two data frames are not identical even though they provide information that may look similar. This is partly due to different formatting norms in each file type. In JSON, the attributes are lower camel case which means that they start with a lowercase letter (ex. title) while in the HTML table, they start with an uppercase (ex. Title). Another difference is how the multiple authors are handled for the first book (In Pursuit of the Perfect Portfolio). In JSON, the two authors are contained in an array. In HTML, the group of authors are within a single string separated by a comma. All in all, this is an interesting assignment showing how you have to take into account different formats for different file types.