Week7

Author

ZIHAO YU

1.How will I tackle the problem?

I will first select three books on the same topic and organize the identical information from each book. Since I am unfamiliar with these formats, I will manually create HTML and JSON files, ensuring the book data is consistent across both files. I will also utilize LLM to help me better familiarize with new content.

2.What data challenges do I anticipate?

Since I am unfamiliar with this, I ask LLM about the data challenges I may encounter, including maintaining consistent author information formatting and ensuring data consistency when imported into R.

Books I select: “Abstract Algebra, 3rd Edition David S. Dummit, Richard M. Foote”.

“Statistical Inference Second Edition; George Casella 和 Roger L. Berger”.

“title: MATHEMATICS I author: KESMIA Mounira”


library(rvest)
library(jsonlite)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ purrr::flatten()        masks jsonlite::flatten()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
html_df <- read_html("https://github.com/XxY-coder/data607-week7/raw/refs/heads/main/books.html")
html_df <- html_table(html_df, fill = TRUE)[[1]]

my_df <- as.data.frame(html_df)
my_df
                          title                           authors year
1 Abstract Algebra, 3rd Edition David S. Dummit; Richard M. Foote 2003
2         Statistical Inference   George Casella; Roger L. Berger 2002
3                 Mathematics I                Augustus De Morgan 2021
            publisher         ISBN
1               Wiley 9.780471e+12
2    Cengage Learning 9.780534e+12
3 Legare Street Press 9.781015e+12
json_df <- fromJSON("https://github.com/XxY-coder/data607-week7/raw/refs/heads/main/books.json")

json_df 
                          title                           authors year
1 Abstract Algebra, 3rd Edition David S. Dummit; Richard M. Foote 2003
2         Statistical Inference   George Casella; Roger L. Berger 2002
3                 Mathematics I                Augustus De Morgan 2021
            publisher          ISBN
1               Wiley 9780471433347
2    Cengage Learning 9780534243128
3 Legare Street Press 9781015083639

The two data frames are not identical since ISBN are read as double in html but chr in json.


new_html <-
  my_df %>%
  mutate(
    ISBN = as.character(ISBN)
) %>%
  select(title, authors, year, publisher, ISBN)

new_html
                          title                           authors year
1 Abstract Algebra, 3rd Edition David S. Dummit; Richard M. Foote 2003
2         Statistical Inference   George Casella; Roger L. Berger 2002
3                 Mathematics I                Augustus De Morgan 2021
            publisher          ISBN
1               Wiley 9780471433347
2    Cengage Learning 9780534243128
3 Legare Street Press 9781015083639
identical(new_html, json_df)
[1] TRUE