week_7_html_and_json

Author

Brandon Chanderban

Published

March 12, 2026

Introduction/Approach

The objective of this Week 7 HTML and JSON assignment is to become more familiar with the structures of HTML and JSON data formats, and to demonstrate how both may be manually created and then imported into RStudio for further usage as data frames. Within the confines of this assignment, the same base dataset will be exhibited in two different file formats (one being an HTML table and the other being a JSON structure) and then loaded into RStudio for comparison.

For the purposes of this assignment, the chosen subject area will be books pertaining to R programming and data analysis. Three books will be selected, with at least one of them containing multiple authors, as set out within the assignment requirements. For each book, the recorded fields will include the title, author/s, and a number of additional attributes such as publication year, publisher, and ISBN.

Once the two source files have been manually constructed, they will then be imported into RStudio using packages suited to each respective format. The imported objects will then be converted into data frames and compared in order to determine whether the HTML or JSON derived versions of the dataset are similar in structure and content.

Data Structure

The dataset to be constructed will contain three observations, each corresponding to one selected book. The variables to be included for each record will be those of:

  • title

  • authors

  • publication_year

  • publisher

  • isbn

The same information (pertaining to the variables above) will be represented in two ways:

  1. The HTML file, which will contain a table structure, with one row corresponding to a singe book and one column mapping to each of the variables, and

  2. The JSON file, which will contain the same information in JSON format, likely as an array of book records, where each record is represented as an object with named key-value pairs.

Owing to the fact that one of the books must have multiple authors, special attention will need to be paid in ensuring that the authors field is represented consistently across the two formats.

Proposed Plan

The analytical approach will likely follow the steps as outlined below.

Firstly, the three books on the identified subject of R programming and data analysis will be selected, ensuring that at least one includes multiple authors. The relevant book details will then be recorded in a consistent manner.

Subsequently, the dataset will be manually encoded into two source files using a plain-text editor on my local computer (for instance, Notepad). These two files being the HTML file containing a table of the book information, and a JSON file containing the same information, only in JSON syntax. These two files will then be saved as books.html and books.json, and uploaded to my public GitHub repository so that they may then be accessed via public web links.

Once this has been done, both files will then be imported into RStudio using suitable packages, converted into data frames, and then compared to determine whether or not they bear identical structures and content.

Potential Challenges

The primary expected challenge that may be encountered relates to ensuring that the authors field is represented consistently across both formats, particularly in the case of the book with multiple authors. If the authors are stored differently in the source files, then this may lead to inconsistencies when the two data frames are held in comparison (only these differences would have been the result of conflicting raw data, and not typical of the different source formats themselves being imported to RStudio).

Prospective Books and Their Metadata

As mentioned prior, the selected subject area will be R programming and data analysis. Three books within this subject area have been chosen, and they satisfy the requirement of at least one containing multiple authors.

The three selected books are as follows:

  1. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data

    • Authors: Hadley Wickham and Garrett Grolemund

    • Publication Year: 2017

    • Publisher: O’Reilly Media

    • ISBN: 9781491910399

  2. Hands-On Programming with R

    • Author: Garrett Grolemund

    • Publication Year: 2014

    • Publisher: O’Reilly Media

    • ISBN: 9781449359072

  3. Advanced R (Second Edition)

    • Author: Hadley Wickham

    • Publication Year: 2019

    • Publisher: Chapman and Hall / CRC

    • ISBN: 9780367255374

Collectively, these aforementioned books provide different levels and perspectives regarding the usage of R, and their information will be manually encoded into both source files for later import and comparison in R.

Code Base/Body

The first step pertaining to this Week 7 assignment to be conducted was the creation of the two HTML and JSON source files, containing the metadata information of the three selected R programming and data analysis books.

These files were created within the basic Notepad local application, saved under the required names (books.html and books.json), and then pushed to my personal GitHub repository for downstream referencing.

As we proceed, the second step to be executed is that of loading in the required libraries for reading in both the HTML and JSON source files.

Code
library(tidyverse)
library(rvest)
library(jsonlite)

Import the HTML file

Now that we have manually created the HTML file and uploaded it to my personal GitHub repository, it may then be read into RStudio by way of the rvest package. Since the HTML file contains a table structure, the information housed within the table itself can be extracted and converted into the desired R data frame.

Code
html_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/week_7_assignment/books.html"

books_html <- html_url %>%
  read_html() %>%
  html_table(fill = TRUE)

#Take the first table from books_html and save it as a data frame
books_html_df <- books_html[[1]]

glimpse(books_html_df)
Rows: 3
Columns: 5
$ title            <chr> "R for Data Science: Import, Tidy, Transform, Visuali…
$ authors          <chr> "Hadley Wickham; Garrett Grolemund", "Garrett Grolemu…
$ publication_year <int> 2017, 2014, 2019
$ publisher        <chr> "O'Reilly Media", "O'Reilly Media", "Chapman and Hall…
$ isbn             <dbl> 9.781492e+12, 9.781449e+12, 9.780367e+12
Code
books_html_df
# A tibble: 3 × 5
  title                               authors publication_year publisher    isbn
  <chr>                               <chr>              <int> <chr>       <dbl>
1 R for Data Science: Import, Tidy, … Hadley…             2017 O'Reilly… 9.78e12
2 Hands-On Programming with R         Garret…             2014 O'Reilly… 9.78e12
3 Advanced R (Second Edition)         Hadley…             2019 Chapman … 9.78e12

Import the JSON file

Our next step entails the importation of the created JSON file, which can now be imported separately by way of the jsonlite package. Since the file itself was created in the form of an array of the three books, it therefore means that it can be read directly into R as a data frame.

Code
json_url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/week_7_assignment/books.json"

books_json_df <- fromJSON(json_url)

glimpse(books_json_df)
Rows: 3
Columns: 5
$ title            <chr> "R for Data Science: Import, Tidy, Transform, Visuali…
$ authors          <chr> "Hadley Wickham; Garrett Grolemund", "Garrett Grolemu…
$ publication_year <int> 2017, 2014, 2019
$ publisher        <chr> "O'Reilly Media", "O'Reilly Media", "Chapman and Hall…
$ isbn             <chr> "9781491910399", "9781449359072", "9780367255374"
Code
books_json_df
                                                                   title
1 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
2                                            Hands-On Programming with R
3                                            Advanced R (Second Edition)
                            authors publication_year              publisher
1 Hadley Wickham; Garrett Grolemund             2017         O'Reilly Media
2                 Garrett Grolemund             2014         O'Reilly Media
3                    Hadley Wickham             2019 Chapman and Hall / CRC
           isbn
1 9781491910399
2 9781449359072
3 9780367255374

Initial Comparison of Our Two Data Frames

At our current stage, both the HTML and JSON source files have been imported into RStudio and stored as two different data frames. This first comparison will be executed before any additional standardization is performed, in order to determine whether or not the two created data frames bear exact resemblance to one another based solely on their imported versions.

Code
identical(books_html_df, books_json_df)
[1] FALSE
Code
all.equal(books_html_df, books_json_df)
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
[3] "Component \"isbn\": Modes: numeric, character"                                         
[4] "Component \"isbn\": target is numeric, current is character"                           

In examining the initial comparisons above, we can see that they indicate that the two imported data frames are not strictly identical, as evidenced by the FALSE result outputted by the identical() function. The all.equal() output further clarifies that the differences arise principally from inconsistencies pertaining to the data frames’ attributes and data types, rather than from differences within the actual book records themselves.

Particularly, the “isbn” variable would have been interpreted as numeric in one data frame and character in the other. There are also some minor class differences between the two data objects. These discrepancies appear to be somewhat typical when importing data from differing file formats and should be resolved through the subsequent standardization of the variable types.

Standardizing the Data Types

In order to now ensure that both imported data frames bear the same structure as closely as possible, the relevant fields can now be standardized. In this case, the publication_year variable will be converted to numeric in both data frames, whilst the isbn variable will be set as a character variable (also in both data frames).

Code
books_html_df <- books_html_df %>%
  mutate(
    publication_year = as.numeric(publication_year),
    isbn = as.character(isbn)
  ) %>%
  as_tibble()

books_json_df <- books_json_df %>%
  mutate(
    publication_year = as.numeric(publication_year),
    isbn = as.character(isbn)
  ) %>%
  as_tibble()

Following the standardization of the variable data types, the only remaining difference between the two imported objects pertained to their class attributes rather than to their actual contents. As such, upon converting their variables to the corresponding data types, both data objects were also converted into the same data frame class (via the as_tibble() function).

Final Comparison (After Standardization)

Now that the two data frames have been standardized, we can now compare them once more in order to determine whether they are now identical in both structure as well as content.

Code
identical(books_html_df, books_json_df)
[1] TRUE
Code
all.equal(books_html_df, books_json_df)
[1] TRUE

Following the standardization of both the variable data types and the data frame classes, the final comparison indicates that the two imported data frames are now identical in both structure and content. This is evidenced by the TRUE outputs returned by both the identical() and all.equal() functions.

Conclusion

In completing this Week 7 HTML and JSON assignment, the same small dataset of books related to R programming and data analysis was manually constructed in two different source formats, namely HTML and JSON, and then imported into RStudio for comparison. Though the initial imported data frames exhibited slight differences in data types and class attributes, these inconsistencies were resolved through standardization, after which the two data frames were shown to be fully identical. Overall, the exercise demonstrated that while HTML and JSON differ in syntax and structural representation, both may still encode the same underlying information and be converted into equivalent dataframes for downstream analytical use within R.

References

  • Grolemund, G. (2014). Hands-on programming with R. O’Reilly Media.

  • Wickham, H. (2019). Advanced R (2nd ed.). Chapman and Hall/CRC.

  • Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.

LLM Used