Overview / Introduction

For this assignment, we have received data from a fictional store named “CUNYMart” and have been tasked with preparing the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, we must provide the pros and cons of each format.

Load Packages

As always, let’s start off by loading the necessary packages.

library(tidyverse)
library(dplyr)
library(jsonlite)
library(kableExtra)
library(xml2)
library(arrow)

Import Data

Now let’s load the data from a .CSV file that’s been created and stored on GitHub.

url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/CUNYMart_data.csv')
data <- read_csv(url)
glimpse(data)
## Rows: 20
## Columns: 7
## $ Category            <chr> "Electronics", "Electronics", "Electronics", "Elec…
## $ `Item Name`         <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "R…
## $ `Item ID`           <dbl> 101, 101, 102, 102, 201, 201, 202, 202, 301, 301, …
## $ Brand               <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBran…
## $ Price               <dbl> 699.99, 699.99, 1099.99, 1099.99, 899.99, 899.99, …
## $ `Variation ID`      <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-…
## $ `Variation Details` <chr> "Color: Black, Storage: 64GB", "Color: White, Stor…

JSON

First, we will format the data into JSON using the toJSON() function from the jsonlite package. Right off the bat we can see that JSON is stored differently from a .CSV data frame, which makes sense since the JSON file type is tailored towards online or the webpage experience. Below are some apparent pros and cons of JSON.

Pros: 1. Easy to learn and simple to read and understand. 2. Text-only format, making it easy to send across servers. 3. Faster and easier to parse than XML.

Cons: 1. Not suitable for complex data structures. 2. Not as useful for aggregate queries compared to other formats like XML

json_data <- toJSON(data, pretty = TRUE)
json_data
## [
##   {
##     "Category": "Electronics",
##     "Item Name": "Smartphone",
##     "Item ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation ID": "101-A",
##     "Variation Details": "Color: Black, Storage: 64GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Smartphone",
##     "Item ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation ID": "101-B",
##     "Variation Details": "Color: White, Storage: 128GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Laptop",
##     "Item ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation ID": "102-A",
##     "Variation Details": "Color: Silver, Storage: 256GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Laptop",
##     "Item ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation ID": "102-B",
##     "Variation Details": "Color: Space Gray, Storage: 512GB"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Refrigerator",
##     "Item ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation ID": "201-A",
##     "Variation Details": "Color: Stainless Steel, Capacity: 20 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Refrigerator",
##     "Item ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation ID": "201-B",
##     "Variation Details": "Color: White, Capacity: 18 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Washing Machine",
##     "Item ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation ID": "202-A",
##     "Variation Details": "Type: Front Load, Capacity: 4.5 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Washing Machine",
##     "Item ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation ID": "202-B",
##     "Variation Details": "Type: Top Load, Capacity: 5.0 cu ft"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-A",
##     "Variation Details": "Color: Blue, Size: S"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-B",
##     "Variation Details": "Color: Red, Size: M"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-C",
##     "Variation Details": "Color: Green, Size: L"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "Jeans",
##     "Item ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation ID": "302-A",
##     "Variation Details": "Color: Dark Blue, Size: 32"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "Jeans",
##     "Item ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation ID": "302-B",
##     "Variation Details": "Color: Light Blue, Size: 34"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Fiction Novel",
##     "Item ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation ID": "401-A",
##     "Variation Details": "Format: Hardcover, Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Fiction Novel",
##     "Item ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation ID": "401-B",
##     "Variation Details": "Format: Paperback, Language: Spanish"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Non-Fiction Guide",
##     "Item ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation ID": "402-A",
##     "Variation Details": "Format: eBook, Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Non-Fiction Guide",
##     "Item ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation ID": "402-B",
##     "Variation Details": "Format: Paperback, Language: French"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Basketball",
##     "Item ID": 501,
##     "Brand": "SportsGear",
##     "Price": 29.99,
##     "Variation ID": "501-A",
##     "Variation Details": "Size: Size 7, Color: Orange"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Tennis Racket",
##     "Item ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation ID": "502-A",
##     "Variation Details": "Material: Graphite, Color: Black"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Tennis Racket",
##     "Item ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation ID": "502-B",
##     "Variation Details": "Material: Aluminum, Color: Silve"
##   }
## ]

HTML

Next, we format the data into HTML using the kable() function from the kableExtra package. Being that this is HTML, we can also manipulate the aesthetics of the table using the kable_styling() function to produce the table seen below. Afterwards, we write the table into a file called “data.html”. Below are some apparent pros and cons of HTML.

Pros: 1. Lightweight, quick, and flexible 2. Each browser supports HTML 3. Easy to learn

Cons: 1. Limited for displaying content 2. Can be lengthy in code 3. Has a complex structure

html_data <- kable(data, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
write(html_data, file = "data.html")
html_data
Category Item Name Item ID Brand Price Variation ID Variation Details
Electronics Smartphone 101 TechBrand 699.99 101-A Color: Black, Storage: 64GB
Electronics Smartphone 101 TechBrand 699.99 101-B Color: White, Storage: 128GB
Electronics Laptop 102 CompuBrand 1099.99 102-A Color: Silver, Storage: 256GB
Electronics Laptop 102 CompuBrand 1099.99 102-B Color: Space Gray, Storage: 512GB
Home Appliances Refrigerator 201 HomeCool 899.99 201-A Color: Stainless Steel, Capacity: 20 cu ft
Home Appliances Refrigerator 201 HomeCool 899.99 201-B Color: White, Capacity: 18 cu ft
Home Appliances Washing Machine 202 CleanTech 499.99 202-A Type: Front Load, Capacity: 4.5 cu ft
Home Appliances Washing Machine 202 CleanTech 499.99 202-B Type: Top Load, Capacity: 5.0 cu ft
Clothing T-Shirt 301 FashionCo 19.99 301-A Color: Blue, Size: S
Clothing T-Shirt 301 FashionCo 19.99 301-B Color: Red, Size: M
Clothing T-Shirt 301 FashionCo 19.99 301-C Color: Green, Size: L
Clothing Jeans 302 DenimWorks 49.99 302-A Color: Dark Blue, Size: 32
Clothing Jeans 302 DenimWorks 49.99 302-B Color: Light Blue, Size: 34
Books Fiction Novel 401
14.99 401-A Format: Hardcover, Language: English
Books Fiction Novel 401
14.99 401-B Format: Paperback, Language: Spanish
Books Non-Fiction Guide 402
24.99 402-A Format: eBook, Language: English
Books Non-Fiction Guide 402
24.99 402-B Format: Paperback, Language: French
Sports Equipment Basketball 501 SportsGear 29.99 501-A Size: Size 7, Color: Orange
Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A Material: Graphite, Color: Black
Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B Material: Aluminum, Color: Silve

XML

Now, we format the data into XML through conjunction of the xml_new_root() and the xml_add_child() functions. Where xml_new_root() establishes an XML table and we use a for loop to fill that table with our CUNYMart data. This is all being provided through the xml2 package. Once that’s done, we go ahead and write the XML data into an XML file. Although the end process is similar to HTML, actually creating the table is much more meticulous. Below are some apparent pros and cons of XML.

Pros: 1. Provides a consistent format for data exchange 2. supports hierarchical, data structures 3. Great for data validation

Cons: 1. Files are big and wordy, compared to JSON and CSV 2. Large file size 3. Can have slower transmission and processing times

xml_data <- xml_new_root("root")
for (i in 1:nrow(data)) {
  row_node <- xml_add_child(xml_data,"row")
  for (col in names(data)) {
    xml_add_child(row_node,col,data[i,col])
  }
}
write_xml(xml_data, "data.xml")
xml_data
## {xml_document}
## <root>
##  [1] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [2] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [3] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [4] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [5] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [6] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [7] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [8] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [9] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [10] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [11] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [12] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [13] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [14] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [15] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [16] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [17] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [18] <row>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <It ...
## [19] <row>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <It ...
## [20] <row>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <It ...

Parquet

Lastly, we format the data into Parquet using the write_parquet() function from the arrow package. Similar to JSON, this is a two line transformation. However, the product is a completely understandable table. It seems the only major difference between this file and a CSV is the space saved but must later be parsed. Below are some apparent pros and cons of Parquet.

Pros: 1. Efficient compression 2. Fast query performance 3. Compatible with many big data tools

Cons 1. Higher CPU usage 2. Slower write performance

parquet_data <- write_parquet(data, "data.parquet")
read_parquet("data.parquet")
## # A tibble: 20 × 7
##    Category         `Item Name`       `Item ID` Brand       Price `Variation ID`
##  * <chr>            <chr>                 <dbl> <chr>       <dbl> <chr>         
##  1 Electronics      Smartphone              101 TechBrand   700.  101-A         
##  2 Electronics      Smartphone              101 TechBrand   700.  101-B         
##  3 Electronics      Laptop                  102 CompuBrand 1100.  102-A         
##  4 Electronics      Laptop                  102 CompuBrand 1100.  102-B         
##  5 Home Appliances  Refrigerator            201 HomeCool    900.  201-A         
##  6 Home Appliances  Refrigerator            201 HomeCool    900.  201-B         
##  7 Home Appliances  Washing Machine         202 CleanTech   500.  202-A         
##  8 Home Appliances  Washing Machine         202 CleanTech   500.  202-B         
##  9 Clothing         T-Shirt                 301 FashionCo    20.0 301-A         
## 10 Clothing         T-Shirt                 301 FashionCo    20.0 301-B         
## 11 Clothing         T-Shirt                 301 FashionCo    20.0 301-C         
## 12 Clothing         Jeans                   302 DenimWorks   50.0 302-A         
## 13 Clothing         Jeans                   302 DenimWorks   50.0 302-B         
## 14 Books            Fiction Novel           401 -            15.0 401-A         
## 15 Books            Fiction Novel           401 -            15.0 401-B         
## 16 Books            Non-Fiction Guide       402 -            25.0 402-A         
## 17 Books            Non-Fiction Guide       402 -            25.0 402-B         
## 18 Sports Equipment Basketball              501 SportsGear   30.0 501-A         
## 19 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-A         
## 20 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-B         
## # ℹ 1 more variable: `Variation Details` <chr>

Conclusions / Findings and Recommendations

Each file format has its own set of pros and cons. CSV and Parquet seem to be the most familiar with regards to data structure and storage but Parquet can take a bit of computing power to unpack. JSON is probably the next in simplicity with regards to importing the data. Finally, HTML and XML prove to be the least conventional and can even take up a bit of storage space. All in all, getting familiar with each format is definitely beneficial in the field of data.

Works Cited

Agusmahari. (2023, April 10). Parquet File Format with Other File Formats: Pros and Cons. Medium. https://medium.com/@agusmahari/parquet-file-format-with-other-file-formats-pros-and-cons-471b00bd6c0a

Colton, E. (2024). Pros and Cons of HTML & CSS for Data Analysis? Appdevelopmentcompanies.co. https://appdevelopmentcompanies.co/web-design/pros-and-cons-of-html-css-for-data-analysis

Maciek. (2024, September 25). CSV vs JSON vs XML - The Best Comparison Guide 2024. Sonra. https://sonra.io/csv-vs-json-vs-xml/

Sahana. (2022, January 5). 9 Advantages and Disadvantages of JSON to Pay Heed to. Tech Quintal. https://www.techquintal.com/advantages-and-disadvantages-of-json/