Project 7 approach

Author

Sam Barbaro

Approach

I’ve created HTML and JSON files and uploaded them to GitHub. These files feature information about books I edited. I will load them here and compare them to see if they are identical and, if not, where they differ.

books.html: https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/books.html

books.json: https://github.com/samanthabarbaro/data607/blob/main/books.json

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(jsonlite)

Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten
books_h <- read_html("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/books.html")


#turn into a data frame
html_df <- books_h |>
  html_table(fill = TRUE) %>%
  .[[1]] 


#Is this a data frame?
is_tibble(html_df)
[1] TRUE
#load json file and turn it into a data frame
books_j <- fromJSON("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/books.json")

Testing whether the data are identical

These shouldn’t be 100% identical since I used a semicolon in the HTML file, but not the JSON file.

#are they identical? 
identical_data <- all.equal(books_j, html_df)

#not exactly, but they're close

identical_data
[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
[3] "Component \"Title\": 1 string mismatch"                                                
[4] "Component \"Authors\": Modes: list, character"                                         
[5] "Component \"Authors\": Component 1: 1 string mismatch"                                 
#A true or false function to check whether they're identical
identical(books_j, html_df)
[1] FALSE
#They are not identical

#Let's make them more identical

html_df_2 <- html_df |>
    separate(col = Authors, into = c("Author_1", "Author_2"), sep = ";") 
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [2, 3].
#I really should have left the semicolon as a separator

books_j2 <- books_j |>
    separate(col = Authors, into = c("Author_1", "Author_2"), sep = ",") 
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
#replacing individual values

books_j2[2, 2] <- "Angela Wood, Ph.D"

books_j2[2, 3] <- NA

#fixing some of the characters, which converted strangely
html_df_2[1, 1] <- "Threshold Concepts in Women’s and Gender Studies"
books_j2[1, 1] <- "Threshold Concepts in Women’s and Gender Studies"


identical_data_2 <- all.equal(books_j2, html_df_2)

identical_data_2
[1] "Attributes: < Component \"class\": Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
diffs <- setdiff(books_j2, html_df_2)
diffs
[1] Title            Author_1         Author_2         Publication_Year
[5] Publisher        ISBN-13         
<0 rows> (or 0-length row.names)
#The tables have fewer differences, but still aren't exactly the same

identical(books_j2, html_df_2)
[1] FALSE

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed March 12, 2026.