For this assignment, we have received data from a fictional store named “CUNYMart” and have been tasked with preparing the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, we must provide the pros and cons of each format.
As always, let’s start off by loading the necessary packages.
library(tidyverse)
library(dplyr)
library(jsonlite)
library(kableExtra)
library(xml2)
library(arrow)
Now let’s load the data from a .CSV file that’s been created and stored on GitHub.
url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/CUNYMart_data.csv')
data <- read_csv(url)
glimpse(data)
## Rows: 20
## Columns: 7
## $ Category <chr> "Electronics", "Electronics", "Electronics", "Elec…
## $ `Item Name` <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "R…
## $ `Item ID` <dbl> 101, 101, 102, 102, 201, 201, 202, 202, 301, 301, …
## $ Brand <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBran…
## $ Price <dbl> 699.99, 699.99, 1099.99, 1099.99, 899.99, 899.99, …
## $ `Variation ID` <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-…
## $ `Variation Details` <chr> "Color: Black, Storage: 64GB", "Color: White, Stor…
First, we will format the data into JSON using the
toJSON() function from the jsonlite package.
Right off the bat we can see that JSON is stored differently from a .CSV
data frame, which makes sense since the JSON file type is tailored
towards online or the webpage experience. Below are some apparent pros
and cons of JSON.
Pros: 1. Easy to learn and simple to read and understand. 2. Text-only format, making it easy to send across servers. 3. Faster and easier to parse than XML.
Cons: 1. Not suitable for complex data structures. 2. Not as useful for aggregate queries compared to other formats like XML
json_data <- toJSON(data, pretty = TRUE)
json_data
## [
## {
## "Category": "Electronics",
## "Item Name": "Smartphone",
## "Item ID": 101,
## "Brand": "TechBrand",
## "Price": 699.99,
## "Variation ID": "101-A",
## "Variation Details": "Color: Black, Storage: 64GB"
## },
## {
## "Category": "Electronics",
## "Item Name": "Smartphone",
## "Item ID": 101,
## "Brand": "TechBrand",
## "Price": 699.99,
## "Variation ID": "101-B",
## "Variation Details": "Color: White, Storage: 128GB"
## },
## {
## "Category": "Electronics",
## "Item Name": "Laptop",
## "Item ID": 102,
## "Brand": "CompuBrand",
## "Price": 1099.99,
## "Variation ID": "102-A",
## "Variation Details": "Color: Silver, Storage: 256GB"
## },
## {
## "Category": "Electronics",
## "Item Name": "Laptop",
## "Item ID": 102,
## "Brand": "CompuBrand",
## "Price": 1099.99,
## "Variation ID": "102-B",
## "Variation Details": "Color: Space Gray, Storage: 512GB"
## },
## {
## "Category": "Home Appliances",
## "Item Name": "Refrigerator",
## "Item ID": 201,
## "Brand": "HomeCool",
## "Price": 899.99,
## "Variation ID": "201-A",
## "Variation Details": "Color: Stainless Steel, Capacity: 20 cu ft"
## },
## {
## "Category": "Home Appliances",
## "Item Name": "Refrigerator",
## "Item ID": 201,
## "Brand": "HomeCool",
## "Price": 899.99,
## "Variation ID": "201-B",
## "Variation Details": "Color: White, Capacity: 18 cu ft"
## },
## {
## "Category": "Home Appliances",
## "Item Name": "Washing Machine",
## "Item ID": 202,
## "Brand": "CleanTech",
## "Price": 499.99,
## "Variation ID": "202-A",
## "Variation Details": "Type: Front Load, Capacity: 4.5 cu ft"
## },
## {
## "Category": "Home Appliances",
## "Item Name": "Washing Machine",
## "Item ID": 202,
## "Brand": "CleanTech",
## "Price": 499.99,
## "Variation ID": "202-B",
## "Variation Details": "Type: Top Load, Capacity: 5.0 cu ft"
## },
## {
## "Category": "Clothing",
## "Item Name": "T-Shirt",
## "Item ID": 301,
## "Brand": "FashionCo",
## "Price": 19.99,
## "Variation ID": "301-A",
## "Variation Details": "Color: Blue, Size: S"
## },
## {
## "Category": "Clothing",
## "Item Name": "T-Shirt",
## "Item ID": 301,
## "Brand": "FashionCo",
## "Price": 19.99,
## "Variation ID": "301-B",
## "Variation Details": "Color: Red, Size: M"
## },
## {
## "Category": "Clothing",
## "Item Name": "T-Shirt",
## "Item ID": 301,
## "Brand": "FashionCo",
## "Price": 19.99,
## "Variation ID": "301-C",
## "Variation Details": "Color: Green, Size: L"
## },
## {
## "Category": "Clothing",
## "Item Name": "Jeans",
## "Item ID": 302,
## "Brand": "DenimWorks",
## "Price": 49.99,
## "Variation ID": "302-A",
## "Variation Details": "Color: Dark Blue, Size: 32"
## },
## {
## "Category": "Clothing",
## "Item Name": "Jeans",
## "Item ID": 302,
## "Brand": "DenimWorks",
## "Price": 49.99,
## "Variation ID": "302-B",
## "Variation Details": "Color: Light Blue, Size: 34"
## },
## {
## "Category": "Books",
## "Item Name": "Fiction Novel",
## "Item ID": 401,
## "Brand": "-",
## "Price": 14.99,
## "Variation ID": "401-A",
## "Variation Details": "Format: Hardcover, Language: English"
## },
## {
## "Category": "Books",
## "Item Name": "Fiction Novel",
## "Item ID": 401,
## "Brand": "-",
## "Price": 14.99,
## "Variation ID": "401-B",
## "Variation Details": "Format: Paperback, Language: Spanish"
## },
## {
## "Category": "Books",
## "Item Name": "Non-Fiction Guide",
## "Item ID": 402,
## "Brand": "-",
## "Price": 24.99,
## "Variation ID": "402-A",
## "Variation Details": "Format: eBook, Language: English"
## },
## {
## "Category": "Books",
## "Item Name": "Non-Fiction Guide",
## "Item ID": 402,
## "Brand": "-",
## "Price": 24.99,
## "Variation ID": "402-B",
## "Variation Details": "Format: Paperback, Language: French"
## },
## {
## "Category": "Sports Equipment",
## "Item Name": "Basketball",
## "Item ID": 501,
## "Brand": "SportsGear",
## "Price": 29.99,
## "Variation ID": "501-A",
## "Variation Details": "Size: Size 7, Color: Orange"
## },
## {
## "Category": "Sports Equipment",
## "Item Name": "Tennis Racket",
## "Item ID": 502,
## "Brand": "RacketPro",
## "Price": 89.99,
## "Variation ID": "502-A",
## "Variation Details": "Material: Graphite, Color: Black"
## },
## {
## "Category": "Sports Equipment",
## "Item Name": "Tennis Racket",
## "Item ID": 502,
## "Brand": "RacketPro",
## "Price": 89.99,
## "Variation ID": "502-B",
## "Variation Details": "Material: Aluminum, Color: Silve"
## }
## ]
Next, we format the data into HTML using the kable()
function from the kableExtra package. Being that this is
HTML, we can also manipulate the aesthetics of the table using the
kable_styling() function to produce the table seen below.
Afterwards, we write the table into a file called “data.html”. Below are
some apparent pros
and cons of HTML.
Pros: 1. Lightweight, quick, and flexible 2. Each browser supports HTML 3. Easy to learn
Cons: 1. Limited for displaying content 2. Can be lengthy in code 3. Has a complex structure
html_data <- kable(data, format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
write(html_data, file = "data.html")
html_data
| Category | Item Name | Item ID | Brand | Price | Variation ID | Variation Details |
|---|---|---|---|---|---|---|
| Electronics | Smartphone | 101 | TechBrand | 699.99 | 101-A | Color: Black, Storage: 64GB |
| Electronics | Smartphone | 101 | TechBrand | 699.99 | 101-B | Color: White, Storage: 128GB |
| Electronics | Laptop | 102 | CompuBrand | 1099.99 | 102-A | Color: Silver, Storage: 256GB |
| Electronics | Laptop | 102 | CompuBrand | 1099.99 | 102-B | Color: Space Gray, Storage: 512GB |
| Home Appliances | Refrigerator | 201 | HomeCool | 899.99 | 201-A | Color: Stainless Steel, Capacity: 20 cu ft |
| Home Appliances | Refrigerator | 201 | HomeCool | 899.99 | 201-B | Color: White, Capacity: 18 cu ft |
| Home Appliances | Washing Machine | 202 | CleanTech | 499.99 | 202-A | Type: Front Load, Capacity: 4.5 cu ft |
| Home Appliances | Washing Machine | 202 | CleanTech | 499.99 | 202-B | Type: Top Load, Capacity: 5.0 cu ft |
| Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-A | Color: Blue, Size: S |
| Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-B | Color: Red, Size: M |
| Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-C | Color: Green, Size: L |
| Clothing | Jeans | 302 | DenimWorks | 49.99 | 302-A | Color: Dark Blue, Size: 32 |
| Clothing | Jeans | 302 | DenimWorks | 49.99 | 302-B | Color: Light Blue, Size: 34 |
| Books | Fiction Novel | 401 |
|
14.99 | 401-A | Format: Hardcover, Language: English |
| Books | Fiction Novel | 401 |
|
14.99 | 401-B | Format: Paperback, Language: Spanish |
| Books | Non-Fiction Guide | 402 |
|
24.99 | 402-A | Format: eBook, Language: English |
| Books | Non-Fiction Guide | 402 |
|
24.99 | 402-B | Format: Paperback, Language: French |
| Sports Equipment | Basketball | 501 | SportsGear | 29.99 | 501-A | Size: Size 7, Color: Orange |
| Sports Equipment | Tennis Racket | 502 | RacketPro | 89.99 | 502-A | Material: Graphite, Color: Black |
| Sports Equipment | Tennis Racket | 502 | RacketPro | 89.99 | 502-B | Material: Aluminum, Color: Silve |
Now, we format the data into XML through conjunction of the
xml_new_root() and the xml_add_child()
functions. Where xml_new_root() establishes an XML table
and we use a for loop to fill that table with our CUNYMart data. This is
all being provided through the xml2 package. Once that’s
done, we go ahead and write the XML data into an XML file. Although the
end process is similar to HTML, actually creating the table is much more
meticulous. Below are some apparent pros and cons of
XML.
Pros: 1. Provides a consistent format for data exchange 2. supports hierarchical, data structures 3. Great for data validation
Cons: 1. Files are big and wordy, compared to JSON and CSV 2. Large file size 3. Can have slower transmission and processing times
xml_data <- xml_new_root("root")
for (i in 1:nrow(data)) {
row_node <- xml_add_child(xml_data,"row")
for (col in names(data)) {
xml_add_child(row_node,col,data[i,col])
}
}
write_xml(xml_data, "data.xml")
xml_data
## {xml_document}
## <root>
## [1] <row>\n <Category>list(Category = "Electronics")</Category>\n <Item Na ...
## [2] <row>\n <Category>list(Category = "Electronics")</Category>\n <Item Na ...
## [3] <row>\n <Category>list(Category = "Electronics")</Category>\n <Item Na ...
## [4] <row>\n <Category>list(Category = "Electronics")</Category>\n <Item Na ...
## [5] <row>\n <Category>list(Category = "Home Appliances")</Category>\n <Ite ...
## [6] <row>\n <Category>list(Category = "Home Appliances")</Category>\n <Ite ...
## [7] <row>\n <Category>list(Category = "Home Appliances")</Category>\n <Ite ...
## [8] <row>\n <Category>list(Category = "Home Appliances")</Category>\n <Ite ...
## [9] <row>\n <Category>list(Category = "Clothing")</Category>\n <Item Name> ...
## [10] <row>\n <Category>list(Category = "Clothing")</Category>\n <Item Name> ...
## [11] <row>\n <Category>list(Category = "Clothing")</Category>\n <Item Name> ...
## [12] <row>\n <Category>list(Category = "Clothing")</Category>\n <Item Name> ...
## [13] <row>\n <Category>list(Category = "Clothing")</Category>\n <Item Name> ...
## [14] <row>\n <Category>list(Category = "Books")</Category>\n <Item Name>lis ...
## [15] <row>\n <Category>list(Category = "Books")</Category>\n <Item Name>lis ...
## [16] <row>\n <Category>list(Category = "Books")</Category>\n <Item Name>lis ...
## [17] <row>\n <Category>list(Category = "Books")</Category>\n <Item Name>lis ...
## [18] <row>\n <Category>list(Category = "Sports Equipment")</Category>\n <It ...
## [19] <row>\n <Category>list(Category = "Sports Equipment")</Category>\n <It ...
## [20] <row>\n <Category>list(Category = "Sports Equipment")</Category>\n <It ...
Lastly, we format the data into Parquet using the
write_parquet() function from the arrow
package. Similar to JSON, this is a two line transformation. However,
the product is a completely understandable table. It seems the only
major difference between this file and a CSV is the space saved but must
later be parsed. Below are some apparent pros
and cons of Parquet.
Pros: 1. Efficient compression 2. Fast query performance 3. Compatible with many big data tools
Cons 1. Higher CPU usage 2. Slower write performance
parquet_data <- write_parquet(data, "data.parquet")
read_parquet("data.parquet")
## # A tibble: 20 × 7
## Category `Item Name` `Item ID` Brand Price `Variation ID`
## * <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Electronics Smartphone 101 TechBrand 700. 101-A
## 2 Electronics Smartphone 101 TechBrand 700. 101-B
## 3 Electronics Laptop 102 CompuBrand 1100. 102-A
## 4 Electronics Laptop 102 CompuBrand 1100. 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 900. 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 900. 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 500. 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 500. 202-B
## 9 Clothing T-Shirt 301 FashionCo 20.0 301-A
## 10 Clothing T-Shirt 301 FashionCo 20.0 301-B
## 11 Clothing T-Shirt 301 FashionCo 20.0 301-C
## 12 Clothing Jeans 302 DenimWorks 50.0 302-A
## 13 Clothing Jeans 302 DenimWorks 50.0 302-B
## 14 Books Fiction Novel 401 - 15.0 401-A
## 15 Books Fiction Novel 401 - 15.0 401-B
## 16 Books Non-Fiction Guide 402 - 25.0 402-A
## 17 Books Non-Fiction Guide 402 - 25.0 402-B
## 18 Sports Equipment Basketball 501 SportsGear 30.0 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 90.0 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 90.0 502-B
## # ℹ 1 more variable: `Variation Details` <chr>
Each file format has its own set of pros and cons. CSV and Parquet seem to be the most familiar with regards to data structure and storage but Parquet can take a bit of computing power to unpack. JSON is probably the next in simplicity with regards to importing the data. Finally, HTML and XML prove to be the least conventional and can even take up a bit of storage space. All in all, getting familiar with each format is definitely beneficial in the field of data.
Agusmahari. (2023, April 10). Parquet File Format with Other File Formats: Pros and Cons. Medium. https://medium.com/@agusmahari/parquet-file-format-with-other-file-formats-pros-and-cons-471b00bd6c0a
Colton, E. (2024). Pros and Cons of HTML & CSS for Data Analysis? Appdevelopmentcompanies.co. https://appdevelopmentcompanies.co/web-design/pros-and-cons-of-html-css-for-data-analysis
Maciek. (2024, September 25). CSV vs JSON vs XML - The Best Comparison Guide 2024. Sonra. https://sonra.io/csv-vs-json-vs-xml/
Sahana. (2022, January 5). 9 Advantages and Disadvantages of JSON to Pay Heed to. Tech Quintal. https://www.techquintal.com/advantages-and-disadvantages-of-json/