Data 607 HTML & JSON

Assignment Overview

For this assignment, I created a small dataset of three books related to uncertainty, prediction and decision making. The books I picked are:

- The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb

- Superforecasting: The Art and Science of Prediction by Philip E. Tetlock and Dan Gardner

- Thinking, Fast and Slow by Daniel Kahneman

These books all explore how humans understand randomness, risk, probability, and forecasting.

For each book, I recorded the following:

- Title

- Authors

- Publication Year

- Publisher

- Genre

I manually created the dataset in two formats:

1. HTML using a table structure

2. JSON using key value pairs

Loading Libraries

library(rvest)

library(jsonlite)

Loading HTML data

html_page <- read_html("books.html")

html_df <- html_page |>
  html_element("table") |>
  html_table()

html_df
# A tibble: 3 × 5
  Title                               Authors `Publication Year` Publisher Genre
  <chr>                               <chr>                <int> <chr>     <chr>
1 The Black Swan: The Impact of the … Nassim…               2007 Random H… Nonf…
2 Superforecasting: The Art and Scie… Philip…               2015 Crown Pu… Nonf…
3 Thinking, Fast and Slow             Daniel…               2011 Farrar, … Nonf…

Loading JSON data

json_raw <- fromJSON("books.json")

json_df <- json_raw$books

json_df
                                                Title
1 The Black Swan: The Impact of the Highly Improbable
2 Superforecasting: The Art and Science of Prediction
3                             Thinking, Fast and Slow
                         Authors Publication Year                 Publisher
1          Nassim Nicholas Taleb             2007              Random House
2 Philip E. Tetlock, Dan Gardner             2015          Crown Publishers
3                Daniel Kahneman             2011 Farrar, Straus and Giroux
       Genre
1 Nonfiction
2 Nonfiction
3 Nonfiction

Comparing data sets

html_df <- as.data.frame(html_df)
json_df <- as.data.frame(json_df)

identical(html_df, json_df)
[1] TRUE
all.equal(html_df, json_df)
[1] TRUE

While testing the comparison, I had to adjust the JSON file a few times so that the values and structure matched the HTML data exactly. After converting both objects to standard data frames and making those corrections, the comparison confirmed that the two datasets were identical.