Assignement-Json-Html

Author

Guibril Ramde

Approach

For this assignment, I will create two different data representations of the same dataset: JSON and HTML. The dataset will consist of three books related to the fields of data science and computer science. Each book will include the following attributes: Title, Authors, Publisher, Year, and ISBN.

First, I will create a table in Microsoft Excel containing the selected books and their corresponding attributes. Once the table is completed, it will be exported as a CSV file. The CSV file will then be uploaded to a GitHub repository, allowing the dataset to be accessed using a publicly available raw file URL.

Using R, I will load the CSV data directly from the GitHub raw link and convert the dataset into two formats: JSON and HTML. These files will then be committed and uploaded to the same GitHub repository so they can also be accessed publicly through their raw URLs.

Next, I will write R code to load both the JSON file and the HTML file into separate data frames. Finally, I will compare the two data frames using the identical() function in order to determine whether both data sources contain exactly the same information.

This workflow ensures that the dataset is reproducible, publicly accessible, and can be validated by comparing the two different data formats.Coding section

Loading libraries and using the raw link of csv file from github

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)
library(jsonlite)


Attaching package: 'jsonlite'

The following object is masked from 'package:purrr':

    flatten

library(knitr)

Warning: package 'knitr' was built under R version 4.5.2

library(rvest)


Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

get_data <- "https://raw.githubusercontent.com/japhet125/Assignement-Json-Html/refs/heads/main/Assignementjson%26html%20-%20Sheet1.csv"

books <- read_csv(get_data)

Rows: 3 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Tittle, Authors, Publisher
dbl (2): Year, ISBN

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(books)[names(books) == "Tittle"] <- "Title"
colnames(books)

[1] "Title"     "Authors"   "Publisher" "Year"      "ISBN"

Dataset Description

The dataset used in this assignment was manually created in Microsoft Excel. Three books related to data science and computer science were identified through online research. After collecting the relevant attributes for each book —Title, Authors, Publisher, Year, and ISBN— the information was organized into a structured Excel table. The table was then exported as a CSV file, which serves as the primary data source for this analysis.

Writing the json file using write_json() function

books$ISBN <- as.character(books$ISBN)
write_json(books, "books.json", pretty = TRUE)

readLines("books.json")

 [1] "["                                                                                        
 [2] "  {"                                                                                      
 [3] "    \"Title\": \"Python for Data Analysis\","                                             
 [4] "    \"Authors\": \"Wes McKinney\","                                                       
 [5] "    \"Publisher\": \"O'Reilly\","                                                         
 [6] "    \"Year\": 2022,"                                                                      
 [7] "    \"ISBN\": \"9781098104030\""                                                          
 [8] "  },"                                                                                     
 [9] "  {"                                                                                      
[10] "    \"Title\": \"Introduction to Statistical Learning\","                                 
[11] "    \"Authors\": \"Gareth James, Daniela Witten,   \\nTrevor Hastie, Robert Tibshirani\","
[12] "    \"Publisher\": \"Springer\","                                                         
[13] "    \"Year\": 2021,"                                                                      
[14] "    \"ISBN\": \"9781071614174\""                                                          
[15] "  },"                                                                                     
[16] "  {"                                                                                      
[17] "    \"Title\": \"Hands-On Machine Learning with\\n Scikit-Learn, Keras & TensorFlow\","   
[18] "    \"Authors\": \"Aurélien Géron\","                                                     
[19] "    \"Publisher\": \"O'Reilly\","                                                         
[20] "    \"Year\": 2023,"                                                                      
[21] "    \"ISBN\": \"9781098125974\""                                                          
[22] "  }"                                                                                      
[23] "]"

Creating the html format using kable

kable(books, format = "html")

Title	Authors	Publisher	Year	ISBN
Python for Data Analysis	Wes McKinney	O'Reilly	2022	9781098104030
Introduction to Statistical Learning	Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani	Springer	2021	9781071614174
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow	Aurélien Géron	O'Reilly	2023	9781098125974

Load JSON

After creating and uploading the JSON file to GitHub, the raw GitHub link is used to load the file back into R. The fromJSON() function reads the data and stores it in a separate data frame named books_json. The ISBN column is again converted to character format to ensure consistency during comparison.

json_url <- "https://raw.githubusercontent.com/japhet125/Assignement-Json-Html/refs/heads/main/books.json"
books_json <- fromJSON(json_url)
books_json$ISBN <- as.character(books_json$ISBN)
books_json

                                                              Title
1                                          Python for Data Analysis
2                              Introduction to Statistical Learning
3 Hands-On Machine Learning with\n Scikit-Learn, Keras & TensorFlow
                                                             Authors Publisher
1                                                       Wes McKinney  O'Reilly
2 Gareth James, Daniela Witten,   \nTrevor Hastie, Robert Tibshirani  Springer
3                                                     Aurélien Géron  O'Reilly
  Year          ISBN
1 2022 9781098104030
2 2021 9781071614174
3 2023 9781098125974

Load HTML

The HTML version of the dataset is loaded into R using the raw GitHub link. The read_html() function from the rvest package reads the HTML file, and html_table() extracts the table into a data frame. The ISBN column is then converted to character format so that it matches the JSON data frame.

html_url <- "https://raw.githubusercontent.com/japhet125/Assignement-Json-Html/refs/heads/main/books.html"
books_html <- read_html(html_url) |>
  html_table()

books_html$ISBN <- as.character(books_html$ISBN)
books_html

[[1]]
# A tibble: 3 × 5
  Title                                          Authors Publisher  Year    ISBN
  <chr>                                          <chr>   <chr>     <int>   <dbl>
1 "Python for Data Analysis"                     "Wes M… O'Reilly   2022 9.78e12
2 "Introduction to Statistical Learning"         "Garet… Springer   2021 9.78e12
3 "Hands-On Machine Learning with\n Scikit-Lear… "AurÃ©… O'Reilly   2023 9.78e12

$ISBN
character(0)

Compare the two data frames

To determine whether the JSON and HTML files contain the same information, the two data frames are compared using the identical() function. This function checks whether the objects are exactly the same in both values and structure

identical(books_json, books_html)

[1] FALSE

all_equal(books_json, books_html)

Warning: `all_equal()` was deprecated in dplyr 1.1.0.
ℹ Please use `all.equal()` instead.
ℹ And manually order the rows/cols as needed

`y` must be a data frame.

running the all_equal function it tells me exactly why it is not equal. the output says y must be a data frame, which mean books_html must be a data frame.

Using if else condition to print message all.equal() function

if (isTRUE(all.equal(books_json, books_html))){
  print("The two data frame are equivalent.")
} else {
  print("The two data frames are not exactly equivalent.")
}

[1] "The two data frames are not exactly equivalent."

Making both data identical

books_html <- read_html(html_url) %>%
  html_table() %>%
  .[[1]]
#making sure ISBN are all character for both json and html
books_html$ISBN <- as.character(books_html$ISBN)
books_json$ISBN <- as.character(books_json$ISBN)
#making sure Year are all character for both json and html
books_html$Year <- as.character(books_html$Year)
books_json$Year <- as.character(books_json$Year)
books_html <- books_html[, names(books_json)]

identical(books_json, books_html)

[1] FALSE

str(books_json)

'data.frame':   3 obs. of  5 variables:
 $ Title    : chr  "Python for Data Analysis" "Introduction to Statistical Learning" "Hands-On Machine Learning with\n Scikit-Learn, Keras & TensorFlow"
 $ Authors  : chr  "Wes McKinney" "Gareth James, Daniela Witten,   \nTrevor Hastie, Robert Tibshirani" "Aurélien Géron"
 $ Publisher: chr  "O'Reilly" "Springer" "O'Reilly"
 $ Year     : chr  "2022" "2021" "2023"
 $ ISBN     : chr  "9781098104030" "9781071614174" "9781098125974"

str(books_html)

tibble [3 × 5] (S3: tbl_df/tbl/data.frame)
 $ Title    : chr [1:3] "Python for Data Analysis" "Introduction to Statistical Learning" "Hands-On Machine Learning with\n Scikit-Learn, Keras & TensorFlow"
 $ Authors  : chr [1:3] "Wes McKinney" "Gareth James, Daniela Witten,   \nTrevor Hastie, Robert Tibshirani" "AurÃ©lien GÃ©ron"
 $ Publisher: chr [1:3] "O'Reilly" "Springer" "O'Reilly"
 $ Year     : chr [1:3] "2022" "2021" "2023"
 $ ISBN     : chr [1:3] "9781098104030" "9781071614174" "9781098125974"

all_equal(books_json, books_html)

Warning: `all_equal()` was deprecated in dplyr 1.1.0.
ℹ Please use `all.equal()` instead.
ℹ And manually order the rows/cols as needed

[1] "- Rows in x but not in y: 3\n- Rows in y but not in x: 3\n"

This output tells that the difference comes from “Aurelien Geron” encoding problem and that makes the two data not identical.

To correct if we will use enc2utf8 while creating html_table or we can just fix it temporarily for this homework purpose

books_html$Authors <- gsub("AurÃ©lien GÃ©ron", "Aurélien Géron", books_html$Authors)

books_json <- as.data.frame(books_json)
books_html <- as.data.frame(books_html)
all.equal(books_json, books_html)

[1] TRUE

identical(books_json, books_html)

[1] TRUE

After converting both books_json and books_html into data.frame we have the code returning True.

Conclusion

In this assignment, a dataset containing three books in the fields of data science and computer science was created and analyzed using R. The dataset was initially constructed in Microsoft Excel and exported as a CSV file, which was then uploaded to a GitHub repository to ensure public accessibility and reproducibility.

Using R, the dataset was converted into two different formats: JSON and HTML. Both files were subsequently uploaded to GitHub and accessed through their raw URLs. The JSON file was loaded into R using the fromJSON() function, while the HTML file was parsed using the read_html() and html_table() functions from the rvest package.

During the comparison process, several differences were initially identified between the two data frames. These differences were primarily due to structural and formatting inconsistencies, including the HTML table being returned as a list, variations in column data types, and a character encoding issue affecting the author’s name containing accented characters. After standardizing the data structures, converting column types to consistent formats, and correcting the encoding issue, both datasets were successfully aligned.

The final comparison using the identical() function confirmed that the two data frames were exactly identical, demonstrating that the JSON and HTML representations contained the same information. Overall, this assignment provided valuable experience in data transformation, data format interoperability, web data access, and data validation using R. It also highlighted the importance of addressing issues such as character encoding, data structure consistency, and formatting differences when working with multiple data formats.