library(rvest)
library(jsonlite)Data 607 HTML & JSON
Assignment Overview
For this assignment, I created a small dataset of three books related to uncertainty, prediction and decision making. The books I picked are:
- The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb
- Superforecasting: The Art and Science of Prediction by Philip E. Tetlock and Dan Gardner
- Thinking, Fast and Slow by Daniel Kahneman
These books all explore how humans understand randomness, risk, probability, and forecasting.
For each book, I recorded the following:
- Title
- Authors
- Publication Year
- Publisher
- Genre
I manually created the dataset in two formats:
1. HTML using a table structure
2. JSON using key value pairs
Loading Libraries
Loading HTML data
html_page <- read_html("books.html")
html_df <- html_page |>
html_element("table") |>
html_table()
html_df# A tibble: 3 × 5
Title Authors `Publication Year` Publisher Genre
<chr> <chr> <int> <chr> <chr>
1 The Black Swan: The Impact of the … Nassim… 2007 Random H… Nonf…
2 Superforecasting: The Art and Scie… Philip… 2015 Crown Pu… Nonf…
3 Thinking, Fast and Slow Daniel… 2011 Farrar, … Nonf…
Loading JSON data
json_raw <- fromJSON("books.json")
json_df <- json_raw$books
json_df Title
1 The Black Swan: The Impact of the Highly Improbable
2 Superforecasting: The Art and Science of Prediction
3 Thinking, Fast and Slow
Authors Publication Year Publisher
1 Nassim Nicholas Taleb 2007 Random House
2 Philip E. Tetlock, Dan Gardner 2015 Crown Publishers
3 Daniel Kahneman 2011 Farrar, Straus and Giroux
Genre
1 Nonfiction
2 Nonfiction
3 Nonfiction
Comparing data sets
html_df <- as.data.frame(html_df)
json_df <- as.data.frame(json_df)
identical(html_df, json_df)[1] TRUE
all.equal(html_df, json_df)[1] TRUE
While testing the comparison, I had to adjust the JSON file a few times so that the values and structure matched the HTML data exactly. After converting both objects to standard data frames and making those corrections, the comparison confirmed that the two datasets were identical.