Select three books on a single subject of your choice. At least one book must have multiple authors. For each book, record the title, authors, and two to three additional attributes of your choosing (e.g., publication year, publisher, ISBN, genre, rating).
Using this information, manually create two data files:
An HTML file containing a table with the book data (e.g., books.html)
A JSON file representing the same data (e.g., books.json)
Unless you are already very comfortable with these formats, you are expected to create both files by hand to better understand their structure.
Next, write R code (using packages of your choice) to:
Load the HTML data into an R data frame
Load the JSON data into a separate R data frame
Compare the two data frames and determine whether they are identical
Deliverables
books.html
books.json
R code that loads and compares both data sources
Package your complete solution in a Quarto (.qmd) file and publish it to rpubs.com.
Note: Publishing will require making your HTML and JSON files publicly accessible on the web.
Approach
I plan to purchase 3 CISSP-related books from Amazon. I will collect detailed information about these books and organize the data in Excel. From the Excel dataset, I will generate an HTML page and export the data as a CSV file. The CSV file will then be converted into JSON format.
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
week7url<-"https://github.com/dyc-sps/SPS_DATA607_Week7/raw/5c0c3b23194f0ad491f745999192b63842fa298e/3books.htm"htmlpage <-read_html(week7url)df_html <-html_table(htmlpage,header =TRUE) # after used identical function then add this optiondf_html <- df_html[[1]]df_html <- df_html[!apply(df_html, 1, function(x) all(is.na(x) | x =="")), ]
Title
1 The Official (ISC)2 CISSP CBK Reference (Cissp: Certified Information Systems Security Professional) 6th Edition
2 Destination CISSP: A Concise Guide
3 ISC2 CISSP Certified Information Systems Security Professional Official Study Guide & Practice Tests Bundle (Sybex Study Guide) 4th Edition
Author1 Author2 Author3 Author4
1 Arthur J. Deane Aaron Kraus <NA> <NA>
2 Rob Witcher John Berti Lou Hablas Nick Mitropoulos
3 Mike Chapple James Michael Stewart Darril Gibson David Seidl
ISBN-13 publisher publicaton year ISBN-10
1 978-1119789994 Sybex 2021 1119789990
2 979-8987407721 Destination Certification Inc. 2024 <NA>
3 978-1394258413 Sybex 2024 1394258410
Print length Edition Rank
1 672 pages 6th 91065
2 532 pages <NA> 18847
3 <NA> 4th 11112
identical(df_html,df_json)
[1] FALSE
After reviewing the HTML output, we found that the first row should be set as the header row, and the first column needs to be cleaned to remove the extra “” characters.
Title
1 The Official (ISC)2 CISSP CBK Reference (Cissp: Certified Information Systems Security Professional) 6th Edition
2 Destination CISSP: A Concise Guide
3 ISC2 CISSP Certified Information Systems Security Professional Official Study Guide & Practice Tests Bundle (Sybex Study Guide) 4th Edition
Author1 Author2 Author3 Author4
1 Arthur J. Deane Aaron Kraus <NA> <NA>
2 Rob Witcher John Berti Lou Hablas Nick Mitropoulos
3 Mike Chapple James Michael Stewart Darril Gibson David Seidl
ISBN-13 publisher publicaton year ISBN-10
1 978-1119789994 Sybex 2021 1119789990
2 979-8987407721 Destination Certification Inc. 2024 NA
3 978-1394258413 Sybex 2024 1394258410
Print length Edition Rank
1 672 pages 6th 91065
2 532 pages <NA> 18847
3 <NA> 4th 11112
Change all values to text and check if the data matches exactly using the identical() function.
After processing the HTML table, we successfully cleaned and prepared the data for analysis. The first row was converted to column headers, all columns were converted to text to ensure consistency, and unwanted line breaks () were removed. Any empty cells were replaced with NA, and fully empty rows were removed. Finally, a verification using the identical() function confirmed that the cleaned data exactly matches the expected format, ensuring the table is accurate and ready for further use.