It is now week 7 and the assignment this week will involve HTML and JSON. This assignment is considered a warm up exercise to help us get familiar with the HTML and JSON file formats, and using packages to read these data formats for downstream use in R data frames
Planned Workflow
For my planned workflow, I selected three books involving data science. I chose three books with multiple authors and for each book, I recorded the title, authors, and two to three additional attributes such as the publication year, publisher and ISBN. With the information from the books I manually created two textedit files, one HTML file containing a table with the book data and changed its file type to html. I also used a textedit file and manually created a JSON file with the same data as the html but with the file type .json. I’ll open and load both the html and json files in separate R dataframes, and use tidyverse and dplyr to manipulate the data into a readable format in R. I’ll load the data using rvest to read the html and jsonlite for the json file, ensuring it’s formatted consistently for an identical comparison.
Anticipated Challenges
A challenge I anticipate facing is making sure the syntax in both files are correct to be loaded correctly into R. I manually created these and if I didn’t follow the logic correctly in each of its respective files, my code in R has a high chance of failing if it’s incorrect. Since both files use a unique syntax there’s a fair chance both can load into R in different ways that don’t match each other.
library(rvest)library(jsonlite)library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 3 × 5
Title Author Publisher `ISBN-13` Year
<chr> <chr> <chr> <chr> <int>
1 R for Data Science (2e) Hadley Wick… O'Reilly… 978-1492… 2023
2 An Introduction to Statistical Learning Gareth Jame… Springer 978-1071… 2021
3 Deep Learning Ian Goodfel… MIT Press 978-0262… 2016
Title
1 R for Data Science (2e)
2 Deep Learning
3 An Introduction to Statistical Learning
Authors Publisher
1 Hadley Wickham, Mine Çetinkaya-Rundel, Garrett Grolemund O'Reilly Media
2 Ian Goodfellow, Yoshua Bengio, Aaron Courville MIT Press
3 Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani Springer
ISBN-13 Year
1 978-1492097402 2023
2 978-0262035613 2016
3 978-1071614174 2021
After comparing both dataframes, the identical function returned that it was FALSE. From what I notice, the data frames have some differences in their object attributes and string formatting. The string mismatch is because Cetinkaya-Rundel’s name has a special character in C for JSON and has a regular C in HTML.