I’ve created HTML and JSON files and uploaded them to GitHub. These files feature information about books I edited. I will load them here and compare them to see if they are identical and, if not, where they differ.
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
books_h <-read_html("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/books.html")#turn into a data framehtml_df <- books_h |>html_table(fill =TRUE) %>% .[[1]] #Is this a data frame?is_tibble(html_df)
[1] TRUE
#load json file and turn it into a data framebooks_j <-fromJSON("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/books.json")
Testing whether the data are identical
These shouldn’t be 100% identical since I used a semicolon in the HTML file, but not the JSON file.
#are they identical? identical_data <-all.equal(books_j, html_df)#not exactly, but they're closeidentical_data
#A true or false function to check whether they're identicalidentical(books_j, html_df)
[1] FALSE
#They are not identical#Let's make them more identicalhtml_df_2 <- html_df |>separate(col = Authors, into =c("Author_1", "Author_2"), sep =";")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [2, 3].
#I really should have left the semicolon as a separatorbooks_j2 <- books_j |>separate(col = Authors, into =c("Author_1", "Author_2"), sep =",")
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [3].
#replacing individual valuesbooks_j2[2, 2] <-"Angela Wood, Ph.D"books_j2[2, 3] <-NA#fixing some of the characters, which converted strangelyhtml_df_2[1, 1] <-"Threshold Concepts in Women’s and Gender Studies"books_j2[1, 1] <-"Threshold Concepts in Women’s and Gender Studies"identical_data_2 <-all.equal(books_j2, html_df_2)identical_data_2