I will first select three books on the same topic and organize the identical information from each book. Since I am unfamiliar with these formats, I will manually create HTML and JSON files, ensuring the book data is consistent across both files. I will also utilize LLM to help me better familiarize with new content.
2.What data challenges do I anticipate?
Since I am unfamiliar with this, I ask LLM about the data challenges I may encounter, including maintaining consistent author information formatting and ensuring data consistency when imported into R.
Books I select: “Abstract Algebra, 3rd Edition David S. Dummit, Richard M. Foote”.
“Statistical Inference Second Edition; George Casella 和 Roger L. Berger”.
“title: MATHEMATICS I author: KESMIA Mounira”
library(rvest)library(jsonlite)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
html_df <-read_html("https://github.com/XxY-coder/data607-week7/raw/refs/heads/main/books.html")html_df <-html_table(html_df, fill =TRUE)[[1]]my_df <-as.data.frame(html_df)my_df
title authors year
1 Abstract Algebra, 3rd Edition David S. Dummit; Richard M. Foote 2003
2 Statistical Inference George Casella; Roger L. Berger 2002
3 Mathematics I Augustus De Morgan 2021
publisher ISBN
1 Wiley 9.780471e+12
2 Cengage Learning 9.780534e+12
3 Legare Street Press 9.781015e+12
title authors year
1 Abstract Algebra, 3rd Edition David S. Dummit; Richard M. Foote 2003
2 Statistical Inference George Casella; Roger L. Berger 2002
3 Mathematics I Augustus De Morgan 2021
publisher ISBN
1 Wiley 9780471433347
2 Cengage Learning 9780534243128
3 Legare Street Press 9781015083639
The two data frames are not identical since ISBN are read as double in html but chr in json.
title authors year
1 Abstract Algebra, 3rd Edition David S. Dummit; Richard M. Foote 2003
2 Statistical Inference George Casella; Roger L. Berger 2002
3 Mathematics I Augustus De Morgan 2021
publisher ISBN
1 Wiley 9780471433347
2 Cengage Learning 9780534243128
3 Legare Street Press 9781015083639