Week 7 - HTML and JSON

Author

Kristoff Oliphant

Introduction

It is now week 7 and the assignment this week will involve HTML and JSON. This assignment is considered a warm up exercise to help us get familiar with the HTML and JSON file formats, and using packages to read these data formats for downstream use in R data frames

Planned Workflow

For my planned workflow, I selected three books involving data science. I chose three books with multiple authors and for each book, I recorded the title, authors, and two to three additional attributes such as the publication year, publisher and ISBN. With the information from the books I manually created two textedit files, one HTML file containing a table with the book data and changed its file type to html. I also used a textedit file and manually created a JSON file with the same data as the html but with the file type .json. I’ll open and load both the html and json files in separate R dataframes, and use tidyverse and dplyr to manipulate the data into a readable format in R. I’ll load the data using rvest to read the html and jsonlite for the json file, ensuring it’s formatted consistently for an identical comparison.

Anticipated Challenges

A challenge I anticipate facing is making sure the syntax in both files are correct to be loaded correctly into R. I manually created these and if I didn’t follow the logic correctly in each of its respective files, my code in R has a high chance of failing if it’s incorrect. Since both files use a unique syntax there’s a fair chance both can load into R in different ways that don’t match each other.

library(rvest)
library(jsonlite)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ purrr::flatten()        masks jsonlite::flatten()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

html_raw <- read_html("https://raw.githubusercontent.com/Kristoffgit/data607_week7/refs/heads/main/books.html")
html_df <- html_raw %>%
  html_table() %>%
  .[[1]]

print(html_df)

# A tibble: 3 × 5
  Title                                   Author       Publisher `ISBN-13`  Year
  <chr>                                   <chr>        <chr>     <chr>     <int>
1 R for Data Science (2e)                 Hadley Wick… O'Reilly… 978-1492…  2023
2 An Introduction to Statistical Learning Gareth Jame… Springer  978-1071…  2021
3 Deep Learning                           Ian Goodfel… MIT Press 978-0262…  2016

json_raw <- fromJSON("https://raw.githubusercontent.com/Kristoffgit/data607_week7/refs/heads/main/books.json")
json_df <- as.data.frame(json_raw$books)

print(json_df)

                                    Title
1                 R for Data Science (2e)
2                           Deep Learning
3 An Introduction to Statistical Learning
                                                      Authors      Publisher
1    Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund O'Reilly Media
2              Ian Goodfellow, Yoshua Bengio, Aaron Courville      MIT Press
3 Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani       Springer
         ISBN-13 Year
1 978-1492097402 2023
2 978-0262035613 2016
3 978-1071614174 2021

str(html_df)

tibble [3 × 5] (S3: tbl_df/tbl/data.frame)
 $ Title    : chr [1:3] "R for Data Science (2e)" "An Introduction to Statistical Learning" "Deep Learning"
 $ Author   : chr [1:3] "Hadley Wickham, Mine Cetinkaya-Rundel, Garrett Grolemund" "Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani" "Ian Goodfellow, Yoshua Bengio, Aaron Courville"
 $ Publisher: chr [1:3] "O'Reilly Media" "Springer" "MIT Press"
 $ ISBN-13  : chr [1:3] "978-1492097402" "978-1071614174" "978-0262035613"
 $ Year     : int [1:3] 2023 2021 2016

str(json_df)

'data.frame':   3 obs. of  5 variables:
 $ Title    : chr  "R for Data Science (2e)" "Deep Learning" "An Introduction to Statistical Learning"
 $ Authors  :List of 3
  ..$ : chr  "Hadley Wickham" "Mine Çetinkaya-Rundel" "Garrett Grolemund"
  ..$ : chr  "Ian Goodfellow" "Yoshua Bengio" "Aaron Courville"
  ..$ : chr  "Gareth James" "Daniela Witten" "Trevor Hastie" "Rob Tibshirani"
 $ Publisher: chr  "O'Reilly Media" "MIT Press" "Springer"
 $ ISBN-13  : chr  "978-1492097402" "978-0262035613" "978-1071614174"
 $ Year     : int  2023 2016 2021

json_df <- json_df %>%
  mutate(Authors = sapply(Authors, paste, collapse = ", ")) %>%
  rename(Author = Authors)

html_df <- html_df %>% arrange(Title)
json_df <- json_df %>% arrange(Title)

is_identical <- identical(html_df, json_df)

print(is_identical)

[1] FALSE

if(!is_identical) {
  all.equal(html_df, json_df)
}

[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
[3] "Component \"Author\": 1 string mismatch"

Conclusion

After comparing both dataframes, the identical function returned that it was FALSE. From what I notice, the data frames have some differences in their object attributes and string formatting. The string mismatch is because Cetinkaya-Rundel’s name has a special character in C for JSON and has a regular C in HTML.