Assignment JSON/HTML

Author

Michael Mayne

Assignment #6

Code Approach

This assignment is a simple project, that requires the conversion of the same set of data onto a JSON file-text format and a HTML file text format. For my example I plan to use the information below.

  • Book: The Subtle Art of Not Giving a F—

  • Author: Mark Manson

  • Genre : Self-Help

  • Publish Year: 2016

  • Page Count: 272

The main goal is to create both files manually using the Notepad application on my computer and typing the information in each groups respective formats. Saving a txt file under with the score .json /.html should change the format of the file. Json can loaded using base R’s read_json(). HTML can be loaded using read_html() after installing the xm12 package. After both are loaded we can see how much the 5 pieces of infomation above carried over with each format.

Code Base

Additional Book # 1

  • Title : The Color Purple

  • Author: Alice Walker

  • Genre: Domestic Fiction

  • Publish Year: 1982

  • Page Count :251

Additional Book # 2

  • Title: His Name is Banksy

  • Author : Francesco Matteuzzi, Marco Maraggi

  • Genre: Graphic Novel

  • Publish Year: 2022

  • Page Count :128

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Manual Creation of Dataset

Both sets of data were created using Notepad as the base text editor

JSON: Created using note pad in standard notation, allowing each detail to be an equivalent object. most were treated as string except for Pages & Publish Year.

{   "title" : [" The Subtle Art of Not Giving a F---","The Color Purple", "His Name is Banksy"],  "author" : ["Mark Manson", "Alice Walker", "Francesco Matteuzzi & Marco Maraggi"],  "genre" : ["Self-Help" , "Domestic Fiction" , "Graphic Novel"],  "year" : [2016, 1982, 2022],  "pages" : [272, 251, 128] } 

HTML: HTML was significantly more involved requiring a much longer format sense, causing me to individual mark out each individual line and row with its own syntax. This resulted in a significantly longer format than the other before it.

<!DOCTYPE html> <html lang="en"> <head>     <meta charset="UTF-8">     <meta name="viewport" content="width=device-width, initial-scale=1.0">     <title>books</title> </head> <body>     <table>     <tr>      <th>title</th>  <th>author</th>     <th>genre</th>  <th>year</th>   <th>pages</th>       </tr>     <tr>       <td>The Subtle Art of Not Giving a F---</td>       <td>Mark Manson</td>       <td>Self-Help</td>       <td>2016</td>       <td>272</td>     </tr>     <tr>       <td>The Color Purple</td>       <td>Alice Walker</td>       <td>Domestic Fiction</td>       <td>1982</td>       <td>251</td>     </tr>     <tr>       <td>His Name is Banksy</td>       <td>Francesco Matteuzzi & Marco Maraggi</td>       <td>Graphic Novel</td>       <td>2022</td>       <td>128</td>     </tr>     </table>      </body> </html>

Loading Both Files

For this assignment we will load each file with similar naming convention and see if the differences between each file.

Loading JSON

For simplicity jsonlite was the recommended package in order to open and view the Data set.

library(jsonlite)

Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':

    flatten
raw_JSON <- fromJSON("https://raw.githubusercontent.com/Mayneman000/DATA607Assignment/refs/heads/main/books.json")

class(raw_JSON)
[1] "list"

The dataset loaded as a list, so I will make a small adjustment to load as a dataframe.

books_JSON <- as.data.frame(raw_JSON)

Loading HTML

Loading HTML usually requires another package for simplicity, so we would need to use rvest as our main package to connect this data

library(rvest)

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding
# Creating Data Frame

raw_HTML <- read_html("https://raw.githubusercontent.com/Mayneman000/DATA607Assignment/refs/heads/main/books.html")

print(raw_HTML)
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\r\n    <table>\n<tr>\n<th>title</th>\r\n\t<th>author</th>\r\n\t<th ...

HTML program also appears as list, so the will need to be converted into a dataframe.

books_HTML <- raw_HTML %>% 
  html_node("table") %>% 
  html_table()

Differences in Methods

To begin, when converted into proper data frames the HTML and JSON are identical. Although a large part of the fact that they are identical can be related to the influences of the packages rvest and jsonlite which automatically handles the raw files.

When compared the raw files we can see a lot of differences…

print(raw_HTML)
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\r\n    <table>\n<tr>\n<th>title</th>\r\n\t<th>author</th>\r\n\t<th ...
print(raw_JSON)
$title
[1] " The Subtle Art of Not Giving a F---"
[2] "The Color Purple"                    
[3] "His Name is Banksy"                  

$author
[1] "Mark Manson"                         "Alice Walker"                       
[3] "Francesco Matteuzzi & Marco Maraggi"

$genre
[1] "Self-Help"        "Domestic Fiction" "Graphic Novel"   

$year
[1] 2016 1982 2022

$pages
[1] 272 251 128

JSON files tends to be simplistic and appears a a list of 5 without conversion it essentially appears as a the original file with no changes. Meanwhile HTML split into 2 long list with the first list being all of meta information added in order to create the document/webpage. As a trade, HTML automatically appears a a file that can show data and info compared to the JSON which needs a another program in order for the information to be viewed.

A though HTML is tailored to use for webpages and direct posting as well as web-scraping, JSON is cleaner and handles large amount of information. JSON file was addressed and cleaned much faster than HTML which took a lot more time and the method took longer overall to pick up.

End of Report.