For this assignment, I will create two different data representations of the same dataset: JSON and HTML. The dataset will consist of three books related to the fields of data science and computer science. Each book will include the following attributes: Title, Authors, Publisher, Year, and ISBN.
First, I will create a table in Microsoft Excel containing the selected books and their corresponding attributes. Once the table is completed, it will be exported as a CSV file. The CSV file will then be uploaded to a GitHub repository, allowing the dataset to be accessed using a publicly available raw file URL.
Using R, I will load the CSV data directly from the GitHub raw link and convert the dataset into two formats: JSON and HTML. These files will then be committed and uploaded to the same GitHub repository so they can also be accessed publicly through their raw URLs.
Next, I will write R code to load both the JSON file and the HTML file into separate data frames. Finally, I will compare the two data frames using the identical() function in order to determine whether both data sources contain exactly the same information.
This workflow ensures that the dataset is reproducible, publicly accessible, and can be validated by comparing the two different data formats.Coding section
Loading libraries and using the raw link of csv file from github
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(readr)library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
library(knitr)
Warning: package 'knitr' was built under R version 4.5.2
library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
Rows: 3 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Tittle, Authors, Publisher
dbl (2): Year, ISBN
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dataset used in this assignment was manually created in Microsoft Excel. Three books related to data science and computer science were identified through online research. After collecting the relevant attributes for each book —Title, Authors, Publisher, Year, and ISBN— the information was organized into a structured Excel table. The table was then exported as a CSV file, which serves as the primary data source for this analysis.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
Springer
2021
9781071614174
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow
Aurélien Géron
O'Reilly
2023
9781098125974
Load JSON
After creating and uploading the JSON file to GitHub, the raw GitHub link is used to load the file back into R. The fromJSON() function reads the data and stores it in a separate data frame named books_json. The ISBN column is again converted to character format to ensure consistency during comparison.
Title
1 Python for Data Analysis
2 Introduction to Statistical Learning
3 Hands-On Machine Learning with\n Scikit-Learn, Keras & TensorFlow
Authors Publisher
1 Wes McKinney O'Reilly
2 Gareth James, Daniela Witten, \nTrevor Hastie, Robert Tibshirani Springer
3 Aurélien Géron O'Reilly
Year ISBN
1 2022 9781098104030
2 2021 9781071614174
3 2023 9781098125974
Load HTML
The HTML version of the dataset is loaded into R using the raw GitHub link. The read_html() function from the rvest package reads the HTML file, and html_table() extracts the table into a data frame. The ISBN column is then converted to character format so that it matches the JSON data frame.
To determine whether the JSON and HTML files contain the same information, the two data frames are compared using the identical() function. This function checks whether the objects are exactly the same in both values and structure
identical(books_json, books_html)
[1] FALSE
all_equal(books_json, books_html)
Warning: `all_equal()` was deprecated in dplyr 1.1.0.
ℹ Please use `all.equal()` instead.
ℹ And manually order the rows/cols as needed
`y` must be a data frame.
running the all_equal function it tells me exactly why it is not equal. the output says y must be a data frame, which mean books_html must be a data frame.
Using if else condition to print message all.equal() function
if (isTRUE(all.equal(books_json, books_html))){print("The two data frame are equivalent.")} else {print("The two data frames are not exactly equivalent.")}
[1] "The two data frames are not exactly equivalent."
Making both data identical
books_html <-read_html(html_url) %>%html_table() %>% .[[1]]#making sure ISBN are all character for both json and htmlbooks_html$ISBN <-as.character(books_html$ISBN)books_json$ISBN <-as.character(books_json$ISBN)#making sure Year are all character for both json and htmlbooks_html$Year <-as.character(books_html$Year)books_json$Year <-as.character(books_json$Year)books_html <- books_html[, names(books_json)]identical(books_json, books_html)
[1] FALSE
str(books_json)
'data.frame': 3 obs. of 5 variables:
$ Title : chr "Python for Data Analysis" "Introduction to Statistical Learning" "Hands-On Machine Learning with\n Scikit-Learn, Keras & TensorFlow"
$ Authors : chr "Wes McKinney" "Gareth James, Daniela Witten, \nTrevor Hastie, Robert Tibshirani" "Aurélien Géron"
$ Publisher: chr "O'Reilly" "Springer" "O'Reilly"
$ Year : chr "2022" "2021" "2023"
$ ISBN : chr "9781098104030" "9781071614174" "9781098125974"
After converting both books_json and books_html into data.frame we have the code returning True.
Conclusion
In this assignment, a dataset containing three books in the fields of data science and computer science was created and analyzed using R. The dataset was initially constructed in Microsoft Excel and exported as a CSV file, which was then uploaded to a GitHub repository to ensure public accessibility and reproducibility.
Using R, the dataset was converted into two different formats: JSON and HTML. Both files were subsequently uploaded to GitHub and accessed through their raw URLs. The JSON file was loaded into R using the fromJSON() function, while the HTML file was parsed using the read_html() and html_table() functions from the rvest package.
During the comparison process, several differences were initially identified between the two data frames. These differences were primarily due to structural and formatting inconsistencies, including the HTML table being returned as a list, variations in column data types, and a character encoding issue affecting the author’s name containing accented characters. After standardizing the data structures, converting column types to consistent formats, and correcting the encoding issue, both datasets were successfully aligned.
The final comparison using the identical() function confirmed that the two data frames were exactly identical, demonstrating that the JSON and HTML representations contained the same information. Overall, this assignment provided valuable experience in data transformation, data format interoperability, web data access, and data validation using R. It also highlighted the importance of addressing issues such as character encoding, data structure consistency, and formatting differences when working with multiple data formats.