Data 607 Assignment 7

Overview / Introduction

For this assignment, we have received data from a fictional store named “CUNYMart” and have been tasked with preparing the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, we must provide the pros and cons of each format.

Load Packages

As always, let’s start off by loading the necessary packages.

library(tidyverse)
library(dplyr)
library(jsonlite)
library(kableExtra)
library(xml2)
library(arrow)

Import Data

Now let’s load the data from a .CSV file that’s been created and stored on GitHub.

url <- url('https://raw.githubusercontent.com/Stevee-G/Data607/refs/heads/main/CUNYMart_data.csv')
data <- read_csv(url)
glimpse(data)

## Rows: 20
## Columns: 7
## $ Category            <chr> "Electronics", "Electronics", "Electronics", "Elec…
## $ `Item Name`         <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "R…
## $ `Item ID`           <dbl> 101, 101, 102, 102, 201, 201, 202, 202, 301, 301, …
## $ Brand               <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBran…
## $ Price               <dbl> 699.99, 699.99, 1099.99, 1099.99, 899.99, 899.99, …
## $ `Variation ID`      <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-…
## $ `Variation Details` <chr> "Color: Black, Storage: 64GB", "Color: White, Stor…

JSON

First, we will format the data into JSON using the toJSON() function from the jsonlite package. Right off the bat we can see that JSON is stored differently from a .CSV data frame, which makes sense since the JSON file type is tailored towards online or the webpage experience. Below are some apparent pros and cons of JSON.

Pros: 1. Easy to learn and simple to read and understand. 2. Text-only format, making it easy to send across servers. 3. Faster and easier to parse than XML.

Cons: 1. Not suitable for complex data structures. 2. Not as useful for aggregate queries compared to other formats like XML

json_data <- toJSON(data, pretty = TRUE)
json_data

## [
##   {
##     "Category": "Electronics",
##     "Item Name": "Smartphone",
##     "Item ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation ID": "101-A",
##     "Variation Details": "Color: Black, Storage: 64GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Smartphone",
##     "Item ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation ID": "101-B",
##     "Variation Details": "Color: White, Storage: 128GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Laptop",
##     "Item ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation ID": "102-A",
##     "Variation Details": "Color: Silver, Storage: 256GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Laptop",
##     "Item ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation ID": "102-B",
##     "Variation Details": "Color: Space Gray, Storage: 512GB"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Refrigerator",
##     "Item ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation ID": "201-A",
##     "Variation Details": "Color: Stainless Steel, Capacity: 20 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Refrigerator",
##     "Item ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation ID": "201-B",
##     "Variation Details": "Color: White, Capacity: 18 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Washing Machine",
##     "Item ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation ID": "202-A",
##     "Variation Details": "Type: Front Load, Capacity: 4.5 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Washing Machine",
##     "Item ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation ID": "202-B",
##     "Variation Details": "Type: Top Load, Capacity: 5.0 cu ft"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-A",
##     "Variation Details": "Color: Blue, Size: S"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-B",
##     "Variation Details": "Color: Red, Size: M"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-C",
##     "Variation Details": "Color: Green, Size: L"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "Jeans",
##     "Item ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation ID": "302-A",
##     "Variation Details": "Color: Dark Blue, Size: 32"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "Jeans",
##     "Item ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation ID": "302-B",
##     "Variation Details": "Color: Light Blue, Size: 34"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Fiction Novel",
##     "Item ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation ID": "401-A",
##     "Variation Details": "Format: Hardcover, Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Fiction Novel",
##     "Item ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation ID": "401-B",
##     "Variation Details": "Format: Paperback, Language: Spanish"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Non-Fiction Guide",
##     "Item ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation ID": "402-A",
##     "Variation Details": "Format: eBook, Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Non-Fiction Guide",
##     "Item ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation ID": "402-B",
##     "Variation Details": "Format: Paperback, Language: French"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Basketball",
##     "Item ID": 501,
##     "Brand": "SportsGear",
##     "Price": 29.99,
##     "Variation ID": "501-A",
##     "Variation Details": "Size: Size 7, Color: Orange"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Tennis Racket",
##     "Item ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation ID": "502-A",
##     "Variation Details": "Material: Graphite, Color: Black"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Tennis Racket",
##     "Item ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation ID": "502-B",
##     "Variation Details": "Material: Aluminum, Color: Silve"
##   }
## ]

HTML

Next, we format the data into HTML using the kable() function from the kableExtra package. Being that this is HTML, we can also manipulate the aesthetics of the table using the kable_styling() function to produce the table seen below. Afterwards, we write the table into a file called “data.html”. Below are some apparent pros and cons of HTML.

Pros: 1. Lightweight, quick, and flexible 2. Each browser supports HTML 3. Easy to learn

Cons: 1. Limited for displaying content 2. Can be lengthy in code 3. Has a complex structure

html_data <- kable(data, format = "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
write(html_data, file = "data.html")
html_data

Category	Item Name	Item ID	Brand	Price	Variation ID	Variation Details
Electronics	Smartphone	101	TechBrand	699.99	101-A	Color: Black, Storage: 64GB
Electronics	Smartphone	101	TechBrand	699.99	101-B	Color: White, Storage: 128GB
Electronics	Laptop	102	CompuBrand	1099.99	102-A	Color: Silver, Storage: 256GB
Electronics	Laptop	102	CompuBrand	1099.99	102-B	Color: Space Gray, Storage: 512GB
Home Appliances	Refrigerator	201	HomeCool	899.99	201-A	Color: Stainless Steel, Capacity: 20 cu ft
Home Appliances	Refrigerator	201	HomeCool	899.99	201-B	Color: White, Capacity: 18 cu ft
Home Appliances	Washing Machine	202	CleanTech	499.99	202-A	Type: Front Load, Capacity: 4.5 cu ft
Home Appliances	Washing Machine	202	CleanTech	499.99	202-B	Type: Top Load, Capacity: 5.0 cu ft
Clothing	T-Shirt	301	FashionCo	19.99	301-A	Color: Blue, Size: S
Clothing	T-Shirt	301	FashionCo	19.99	301-B	Color: Red, Size: M
Clothing	T-Shirt	301	FashionCo	19.99	301-C	Color: Green, Size: L
Clothing	Jeans	302	DenimWorks	49.99	302-A	Color: Dark Blue, Size: 32
Clothing	Jeans	302	DenimWorks	49.99	302-B	Color: Light Blue, Size: 34
Books	Fiction Novel	401		14.99	401-A	Format: Hardcover, Language: English
Books	Fiction Novel	401		14.99	401-B	Format: Paperback, Language: Spanish
Books	Non-Fiction Guide	402		24.99	402-A	Format: eBook, Language: English
Books	Non-Fiction Guide	402		24.99	402-B	Format: Paperback, Language: French
Sports Equipment	Basketball	501	SportsGear	29.99	501-A	Size: Size 7, Color: Orange
Sports Equipment	Tennis Racket	502	RacketPro	89.99	502-A	Material: Graphite, Color: Black
Sports Equipment	Tennis Racket	502	RacketPro	89.99	502-B	Material: Aluminum, Color: Silve

XML

Now, we format the data into XML through conjunction of the xml_new_root() and the xml_add_child() functions. Where xml_new_root() establishes an XML table and we use a for loop to fill that table with our CUNYMart data. This is all being provided through the xml2 package. Once that’s done, we go ahead and write the XML data into an XML file. Although the end process is similar to HTML, actually creating the table is much more meticulous. Below are some apparent pros and cons of XML.

Pros: 1. Provides a consistent format for data exchange 2. supports hierarchical, data structures 3. Great for data validation

Cons: 1. Files are big and wordy, compared to JSON and CSV 2. Large file size 3. Can have slower transmission and processing times

xml_data <- xml_new_root("root")
for (i in 1:nrow(data)) {
  row_node <- xml_add_child(xml_data,"row")
  for (col in names(data)) {
    xml_add_child(row_node,col,data[i,col])
  }
}
write_xml(xml_data, "data.xml")
xml_data

## {xml_document}
## <root>
##  [1] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [2] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [3] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [4] <row>\n  <Category>list(Category = "Electronics")</Category>\n  <Item Na ...
##  [5] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [6] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [7] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [8] <row>\n  <Category>list(Category = "Home Appliances")</Category>\n  <Ite ...
##  [9] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [10] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [11] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [12] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [13] <row>\n  <Category>list(Category = "Clothing")</Category>\n  <Item Name> ...
## [14] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [15] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [16] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [17] <row>\n  <Category>list(Category = "Books")</Category>\n  <Item Name>lis ...
## [18] <row>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <It ...
## [19] <row>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <It ...
## [20] <row>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <It ...

Parquet

Lastly, we format the data into Parquet using the write_parquet() function from the arrow package. Similar to JSON, this is a two line transformation. However, the product is a completely understandable table. It seems the only major difference between this file and a CSV is the space saved but must later be parsed. Below are some apparent pros and cons of Parquet.

Pros: 1. Efficient compression 2. Fast query performance 3. Compatible with many big data tools

Cons 1. Higher CPU usage 2. Slower write performance

parquet_data <- write_parquet(data, "data.parquet")
read_parquet("data.parquet")

## # A tibble: 20 × 7
##    Category         `Item Name`       `Item ID` Brand       Price `Variation ID`
##  * <chr>            <chr>                 <dbl> <chr>       <dbl> <chr>         
##  1 Electronics      Smartphone              101 TechBrand   700.  101-A         
##  2 Electronics      Smartphone              101 TechBrand   700.  101-B         
##  3 Electronics      Laptop                  102 CompuBrand 1100.  102-A         
##  4 Electronics      Laptop                  102 CompuBrand 1100.  102-B         
##  5 Home Appliances  Refrigerator            201 HomeCool    900.  201-A         
##  6 Home Appliances  Refrigerator            201 HomeCool    900.  201-B         
##  7 Home Appliances  Washing Machine         202 CleanTech   500.  202-A         
##  8 Home Appliances  Washing Machine         202 CleanTech   500.  202-B         
##  9 Clothing         T-Shirt                 301 FashionCo    20.0 301-A         
## 10 Clothing         T-Shirt                 301 FashionCo    20.0 301-B         
## 11 Clothing         T-Shirt                 301 FashionCo    20.0 301-C         
## 12 Clothing         Jeans                   302 DenimWorks   50.0 302-A         
## 13 Clothing         Jeans                   302 DenimWorks   50.0 302-B         
## 14 Books            Fiction Novel           401 -            15.0 401-A         
## 15 Books            Fiction Novel           401 -            15.0 401-B         
## 16 Books            Non-Fiction Guide       402 -            25.0 402-A         
## 17 Books            Non-Fiction Guide       402 -            25.0 402-B         
## 18 Sports Equipment Basketball              501 SportsGear   30.0 501-A         
## 19 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-A         
## 20 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-B         
## # ℹ 1 more variable: `Variation Details` <chr>

Data 607 Assignment 7

Steven Gonzalez

10/20/2024