Introduction

In this analysis, we’re working with the CUNYMart inventory dataset, which has been provided as plain text. Our goal is to import this dataset and convert it into several useful formats: JSON, HTML, XML, and Parquet. We’ll explore how each format can be beneficial depending on the context and discuss their pros and cons.

Step 1: Importing the Dataset from Text

First things first, we need to bring the raw dataset into R. Since we’ve received the data as text, let’s start by loading it into R as a single block of text. Here’s the raw dataset:

# Raw text input
raw_text <- "
Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details
Electronics,Smartphone,101,TechBrand,699.99,101-A,Color: Black, Storage: 64GB
Electronics,Smartphone,101,TechBrand,699.99,101-B,Color: White, Storage: 128GB
Electronics,Laptop,102,CompuBrand,1099.99,102-A,Color: Silver, Storage: 256GB
Electronics,Laptop,102,CompuBrand,1099.99,102-B,Color: Space Gray, Storage: 512GB
Home Appliances,Refrigerator,201,HomeCool,899.99,201-A,Color: Stainless Steel, Capacity: 20 cu ft
Home Appliances,Refrigerator,201,HomeCool,899.99,201-B,Color: White, Capacity: 18 cu ft
Home Appliances,Washing Machine,202,CleanTech,499.99,202-A,Type: Front Load, Capacity: 4.5 cu ft
Home Appliances,Washing Machine,202,CleanTech,499.99,202-B,Type: Top Load, Capacity: 5.0 cu ft
Clothing,T-Shirt,301,FashionCo,19.99,301-A,Color: Blue, Size: S
Clothing,T-Shirt,301,FashionCo,19.99,301-B,Color: Red, Size: M
Clothing,T-Shirt,301,FashionCo,19.99,301-C,Color: Green, Size: L
Clothing,Jeans,302,DenimWorks,49.99,302-A,Color: Dark Blue, Size: 32
Clothing,Jeans,302,DenimWorks,49.99,302-B,Color: Light Blue, Size: 34
Books,Fiction Novel,401,-,14.99,401-A,Format: Hardcover, Language: English
Books,Fiction Novel,401,-,14.99,401-B,Format: Paperback, Language: Spanish
Books,Non-Fiction Guide,402,-,24.99,402-A,Format: eBook, Language: English
Books,Non-Fiction Guide,402,-,24.99,402-B,Format: Paperback, Language: French
Sports Equipment,Basketball,501,SportsGear,29.99,501-A,Size: Size 7, Color: Orange
Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-A,Material: Graphite, Color: Black
Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-B,Material: Aluminum, Color: Silver
"

Step 2: Converting Text into a Data Frame

Now that we have the text in R, the next step is to convert it into a structured data frame. A data frame is simply a table that organizes our data into rows and columns.

# Converting raw text to a data frame
library(readr)

# Using `read_csv` to read the text as a CSV
cunymart_data <- read_csv(raw_text)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 20 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Category, Item Name, Brand, Variation ID, Variation Details
## dbl (2): Item ID, Price
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the data frame
print(cunymart_data)
## # A tibble: 20 × 7
##    Category         `Item Name`       `Item ID` Brand       Price `Variation ID`
##    <chr>            <chr>                 <dbl> <chr>       <dbl> <chr>         
##  1 Electronics      Smartphone              101 TechBrand   700.  101-A         
##  2 Electronics      Smartphone              101 TechBrand   700.  101-B         
##  3 Electronics      Laptop                  102 CompuBrand 1100.  102-A         
##  4 Electronics      Laptop                  102 CompuBrand 1100.  102-B         
##  5 Home Appliances  Refrigerator            201 HomeCool    900.  201-A         
##  6 Home Appliances  Refrigerator            201 HomeCool    900.  201-B         
##  7 Home Appliances  Washing Machine         202 CleanTech   500.  202-A         
##  8 Home Appliances  Washing Machine         202 CleanTech   500.  202-B         
##  9 Clothing         T-Shirt                 301 FashionCo    20.0 301-A         
## 10 Clothing         T-Shirt                 301 FashionCo    20.0 301-B         
## 11 Clothing         T-Shirt                 301 FashionCo    20.0 301-C         
## 12 Clothing         Jeans                   302 DenimWorks   50.0 302-A         
## 13 Clothing         Jeans                   302 DenimWorks   50.0 302-B         
## 14 Books            Fiction Novel           401 -            15.0 401-A         
## 15 Books            Fiction Novel           401 -            15.0 401-B         
## 16 Books            Non-Fiction Guide       402 -            25.0 402-A         
## 17 Books            Non-Fiction Guide       402 -            25.0 402-B         
## 18 Sports Equipment Basketball              501 SportsGear   30.0 501-A         
## 19 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-A         
## 20 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-B         
## # ℹ 1 more variable: `Variation Details` <chr>

At this point, we’ve turned the text into a clean, organized table that we can work with.

Step 3: Converting the Data to JSON

JSON (JavaScript Object Notation) is a popular format for exchanging data between systems, especially for web applications. It’s lightweight and easy for machines to parse. Let’s convert our data into JSON format:

library(jsonlite)
# Convert to JSON format
cunymart_json <- toJSON(cunymart_data, pretty = TRUE)
cat(cunymart_json)
## [
##   {
##     "Category": "Electronics",
##     "Item Name": "Smartphone",
##     "Item ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation ID": "101-A",
##     "Variation Details": "Color: Black, Storage: 64GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Smartphone",
##     "Item ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation ID": "101-B",
##     "Variation Details": "Color: White, Storage: 128GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Laptop",
##     "Item ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation ID": "102-A",
##     "Variation Details": "Color: Silver, Storage: 256GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item Name": "Laptop",
##     "Item ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation ID": "102-B",
##     "Variation Details": "Color: Space Gray, Storage: 512GB"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Refrigerator",
##     "Item ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation ID": "201-A",
##     "Variation Details": "Color: Stainless Steel, Capacity: 20 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Refrigerator",
##     "Item ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation ID": "201-B",
##     "Variation Details": "Color: White, Capacity: 18 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Washing Machine",
##     "Item ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation ID": "202-A",
##     "Variation Details": "Type: Front Load, Capacity: 4.5 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item Name": "Washing Machine",
##     "Item ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation ID": "202-B",
##     "Variation Details": "Type: Top Load, Capacity: 5.0 cu ft"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-A",
##     "Variation Details": "Color: Blue, Size: S"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-B",
##     "Variation Details": "Color: Red, Size: M"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "T-Shirt",
##     "Item ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation ID": "301-C",
##     "Variation Details": "Color: Green, Size: L"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "Jeans",
##     "Item ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation ID": "302-A",
##     "Variation Details": "Color: Dark Blue, Size: 32"
##   },
##   {
##     "Category": "Clothing",
##     "Item Name": "Jeans",
##     "Item ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation ID": "302-B",
##     "Variation Details": "Color: Light Blue, Size: 34"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Fiction Novel",
##     "Item ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation ID": "401-A",
##     "Variation Details": "Format: Hardcover, Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Fiction Novel",
##     "Item ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation ID": "401-B",
##     "Variation Details": "Format: Paperback, Language: Spanish"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Non-Fiction Guide",
##     "Item ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation ID": "402-A",
##     "Variation Details": "Format: eBook, Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item Name": "Non-Fiction Guide",
##     "Item ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation ID": "402-B",
##     "Variation Details": "Format: Paperback, Language: French"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Basketball",
##     "Item ID": 501,
##     "Brand": "SportsGear",
##     "Price": 29.99,
##     "Variation ID": "501-A",
##     "Variation Details": "Size: Size 7, Color: Orange"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Tennis Racket",
##     "Item ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation ID": "502-A",
##     "Variation Details": "Material: Graphite, Color: Black"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item Name": "Tennis Racket",
##     "Item ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation ID": "502-B",
##     "Variation Details": "Material: Aluminum, Color: Silver"
##   }
## ]

Pros of JSON:

  1. Simple and readable by machines.
  2. Great for transmitting data between servers and browsers.

Cons of JSON:

  1. Not as human-readable as some other formats.
  2. Does not inherently enforce a schema, so data validation can be tricky.

Step 4: Converting the Data to HTML

HTML (HyperText Markup Language) is primarily used for displaying data in web browsers. Converting the dataset to HTML allows us to easily display it as a table on a webpage:

library(xtable)
# Convert to HTML format
cunymart_html <- print(xtable(cunymart_data), type = 'html')
## <!-- html table generated in R 4.3.3 by xtable 1.8-4 package -->
## <!-- Sun Oct 20 10:36:30 2024 -->
## <table border=1>
## <tr> <th>  </th> <th> Category </th> <th> Item Name </th> <th> Item ID </th> <th> Brand </th> <th> Price </th> <th> Variation ID </th> <th> Variation Details </th>  </tr>
##   <tr> <td align="right"> 1 </td> <td> Electronics </td> <td> Smartphone </td> <td align="right"> 101.00 </td> <td> TechBrand </td> <td align="right"> 699.99 </td> <td> 101-A </td> <td> Color: Black, Storage: 64GB </td> </tr>
##   <tr> <td align="right"> 2 </td> <td> Electronics </td> <td> Smartphone </td> <td align="right"> 101.00 </td> <td> TechBrand </td> <td align="right"> 699.99 </td> <td> 101-B </td> <td> Color: White, Storage: 128GB </td> </tr>
##   <tr> <td align="right"> 3 </td> <td> Electronics </td> <td> Laptop </td> <td align="right"> 102.00 </td> <td> CompuBrand </td> <td align="right"> 1099.99 </td> <td> 102-A </td> <td> Color: Silver, Storage: 256GB </td> </tr>
##   <tr> <td align="right"> 4 </td> <td> Electronics </td> <td> Laptop </td> <td align="right"> 102.00 </td> <td> CompuBrand </td> <td align="right"> 1099.99 </td> <td> 102-B </td> <td> Color: Space Gray, Storage: 512GB </td> </tr>
##   <tr> <td align="right"> 5 </td> <td> Home Appliances </td> <td> Refrigerator </td> <td align="right"> 201.00 </td> <td> HomeCool </td> <td align="right"> 899.99 </td> <td> 201-A </td> <td> Color: Stainless Steel, Capacity: 20 cu ft </td> </tr>
##   <tr> <td align="right"> 6 </td> <td> Home Appliances </td> <td> Refrigerator </td> <td align="right"> 201.00 </td> <td> HomeCool </td> <td align="right"> 899.99 </td> <td> 201-B </td> <td> Color: White, Capacity: 18 cu ft </td> </tr>
##   <tr> <td align="right"> 7 </td> <td> Home Appliances </td> <td> Washing Machine </td> <td align="right"> 202.00 </td> <td> CleanTech </td> <td align="right"> 499.99 </td> <td> 202-A </td> <td> Type: Front Load, Capacity: 4.5 cu ft </td> </tr>
##   <tr> <td align="right"> 8 </td> <td> Home Appliances </td> <td> Washing Machine </td> <td align="right"> 202.00 </td> <td> CleanTech </td> <td align="right"> 499.99 </td> <td> 202-B </td> <td> Type: Top Load, Capacity: 5.0 cu ft </td> </tr>
##   <tr> <td align="right"> 9 </td> <td> Clothing </td> <td> T-Shirt </td> <td align="right"> 301.00 </td> <td> FashionCo </td> <td align="right"> 19.99 </td> <td> 301-A </td> <td> Color: Blue, Size: S </td> </tr>
##   <tr> <td align="right"> 10 </td> <td> Clothing </td> <td> T-Shirt </td> <td align="right"> 301.00 </td> <td> FashionCo </td> <td align="right"> 19.99 </td> <td> 301-B </td> <td> Color: Red, Size: M </td> </tr>
##   <tr> <td align="right"> 11 </td> <td> Clothing </td> <td> T-Shirt </td> <td align="right"> 301.00 </td> <td> FashionCo </td> <td align="right"> 19.99 </td> <td> 301-C </td> <td> Color: Green, Size: L </td> </tr>
##   <tr> <td align="right"> 12 </td> <td> Clothing </td> <td> Jeans </td> <td align="right"> 302.00 </td> <td> DenimWorks </td> <td align="right"> 49.99 </td> <td> 302-A </td> <td> Color: Dark Blue, Size: 32 </td> </tr>
##   <tr> <td align="right"> 13 </td> <td> Clothing </td> <td> Jeans </td> <td align="right"> 302.00 </td> <td> DenimWorks </td> <td align="right"> 49.99 </td> <td> 302-B </td> <td> Color: Light Blue, Size: 34 </td> </tr>
##   <tr> <td align="right"> 14 </td> <td> Books </td> <td> Fiction Novel </td> <td align="right"> 401.00 </td> <td> - </td> <td align="right"> 14.99 </td> <td> 401-A </td> <td> Format: Hardcover, Language: English </td> </tr>
##   <tr> <td align="right"> 15 </td> <td> Books </td> <td> Fiction Novel </td> <td align="right"> 401.00 </td> <td> - </td> <td align="right"> 14.99 </td> <td> 401-B </td> <td> Format: Paperback, Language: Spanish </td> </tr>
##   <tr> <td align="right"> 16 </td> <td> Books </td> <td> Non-Fiction Guide </td> <td align="right"> 402.00 </td> <td> - </td> <td align="right"> 24.99 </td> <td> 402-A </td> <td> Format: eBook, Language: English </td> </tr>
##   <tr> <td align="right"> 17 </td> <td> Books </td> <td> Non-Fiction Guide </td> <td align="right"> 402.00 </td> <td> - </td> <td align="right"> 24.99 </td> <td> 402-B </td> <td> Format: Paperback, Language: French </td> </tr>
##   <tr> <td align="right"> 18 </td> <td> Sports Equipment </td> <td> Basketball </td> <td align="right"> 501.00 </td> <td> SportsGear </td> <td align="right"> 29.99 </td> <td> 501-A </td> <td> Size: Size 7, Color: Orange </td> </tr>
##   <tr> <td align="right"> 19 </td> <td> Sports Equipment </td> <td> Tennis Racket </td> <td align="right"> 502.00 </td> <td> RacketPro </td> <td align="right"> 89.99 </td> <td> 502-A </td> <td> Material: Graphite, Color: Black </td> </tr>
##   <tr> <td align="right"> 20 </td> <td> Sports Equipment </td> <td> Tennis Racket </td> <td align="right"> 502.00 </td> <td> RacketPro </td> <td align="right"> 89.99 </td> <td> 502-B </td> <td> Material: Aluminum, Color: Silver </td> </tr>
##    </table>
cat(cunymart_html)
## <!-- html table generated in R 4.3.3 by xtable 1.8-4 package -->
## <!-- Sun Oct 20 10:36:30 2024 -->
## <table border=1>
## <tr> <th>  </th> <th> Category </th> <th> Item Name </th> <th> Item ID </th> <th> Brand </th> <th> Price </th> <th> Variation ID </th> <th> Variation Details </th>  </tr>
##   <tr> <td align="right"> 1 </td> <td> Electronics </td> <td> Smartphone </td> <td align="right"> 101.00 </td> <td> TechBrand </td> <td align="right"> 699.99 </td> <td> 101-A </td> <td> Color: Black, Storage: 64GB </td> </tr>
##   <tr> <td align="right"> 2 </td> <td> Electronics </td> <td> Smartphone </td> <td align="right"> 101.00 </td> <td> TechBrand </td> <td align="right"> 699.99 </td> <td> 101-B </td> <td> Color: White, Storage: 128GB </td> </tr>
##   <tr> <td align="right"> 3 </td> <td> Electronics </td> <td> Laptop </td> <td align="right"> 102.00 </td> <td> CompuBrand </td> <td align="right"> 1099.99 </td> <td> 102-A </td> <td> Color: Silver, Storage: 256GB </td> </tr>
##   <tr> <td align="right"> 4 </td> <td> Electronics </td> <td> Laptop </td> <td align="right"> 102.00 </td> <td> CompuBrand </td> <td align="right"> 1099.99 </td> <td> 102-B </td> <td> Color: Space Gray, Storage: 512GB </td> </tr>
##   <tr> <td align="right"> 5 </td> <td> Home Appliances </td> <td> Refrigerator </td> <td align="right"> 201.00 </td> <td> HomeCool </td> <td align="right"> 899.99 </td> <td> 201-A </td> <td> Color: Stainless Steel, Capacity: 20 cu ft </td> </tr>
##   <tr> <td align="right"> 6 </td> <td> Home Appliances </td> <td> Refrigerator </td> <td align="right"> 201.00 </td> <td> HomeCool </td> <td align="right"> 899.99 </td> <td> 201-B </td> <td> Color: White, Capacity: 18 cu ft </td> </tr>
##   <tr> <td align="right"> 7 </td> <td> Home Appliances </td> <td> Washing Machine </td> <td align="right"> 202.00 </td> <td> CleanTech </td> <td align="right"> 499.99 </td> <td> 202-A </td> <td> Type: Front Load, Capacity: 4.5 cu ft </td> </tr>
##   <tr> <td align="right"> 8 </td> <td> Home Appliances </td> <td> Washing Machine </td> <td align="right"> 202.00 </td> <td> CleanTech </td> <td align="right"> 499.99 </td> <td> 202-B </td> <td> Type: Top Load, Capacity: 5.0 cu ft </td> </tr>
##   <tr> <td align="right"> 9 </td> <td> Clothing </td> <td> T-Shirt </td> <td align="right"> 301.00 </td> <td> FashionCo </td> <td align="right"> 19.99 </td> <td> 301-A </td> <td> Color: Blue, Size: S </td> </tr>
##   <tr> <td align="right"> 10 </td> <td> Clothing </td> <td> T-Shirt </td> <td align="right"> 301.00 </td> <td> FashionCo </td> <td align="right"> 19.99 </td> <td> 301-B </td> <td> Color: Red, Size: M </td> </tr>
##   <tr> <td align="right"> 11 </td> <td> Clothing </td> <td> T-Shirt </td> <td align="right"> 301.00 </td> <td> FashionCo </td> <td align="right"> 19.99 </td> <td> 301-C </td> <td> Color: Green, Size: L </td> </tr>
##   <tr> <td align="right"> 12 </td> <td> Clothing </td> <td> Jeans </td> <td align="right"> 302.00 </td> <td> DenimWorks </td> <td align="right"> 49.99 </td> <td> 302-A </td> <td> Color: Dark Blue, Size: 32 </td> </tr>
##   <tr> <td align="right"> 13 </td> <td> Clothing </td> <td> Jeans </td> <td align="right"> 302.00 </td> <td> DenimWorks </td> <td align="right"> 49.99 </td> <td> 302-B </td> <td> Color: Light Blue, Size: 34 </td> </tr>
##   <tr> <td align="right"> 14 </td> <td> Books </td> <td> Fiction Novel </td> <td align="right"> 401.00 </td> <td> - </td> <td align="right"> 14.99 </td> <td> 401-A </td> <td> Format: Hardcover, Language: English </td> </tr>
##   <tr> <td align="right"> 15 </td> <td> Books </td> <td> Fiction Novel </td> <td align="right"> 401.00 </td> <td> - </td> <td align="right"> 14.99 </td> <td> 401-B </td> <td> Format: Paperback, Language: Spanish </td> </tr>
##   <tr> <td align="right"> 16 </td> <td> Books </td> <td> Non-Fiction Guide </td> <td align="right"> 402.00 </td> <td> - </td> <td align="right"> 24.99 </td> <td> 402-A </td> <td> Format: eBook, Language: English </td> </tr>
##   <tr> <td align="right"> 17 </td> <td> Books </td> <td> Non-Fiction Guide </td> <td align="right"> 402.00 </td> <td> - </td> <td align="right"> 24.99 </td> <td> 402-B </td> <td> Format: Paperback, Language: French </td> </tr>
##   <tr> <td align="right"> 18 </td> <td> Sports Equipment </td> <td> Basketball </td> <td align="right"> 501.00 </td> <td> SportsGear </td> <td align="right"> 29.99 </td> <td> 501-A </td> <td> Size: Size 7, Color: Orange </td> </tr>
##   <tr> <td align="right"> 19 </td> <td> Sports Equipment </td> <td> Tennis Racket </td> <td align="right"> 502.00 </td> <td> RacketPro </td> <td align="right"> 89.99 </td> <td> 502-A </td> <td> Material: Graphite, Color: Black </td> </tr>
##   <tr> <td align="right"> 20 </td> <td> Sports Equipment </td> <td> Tennis Racket </td> <td align="right"> 502.00 </td> <td> RacketPro </td> <td align="right"> 89.99 </td> <td> 502-B </td> <td> Material: Aluminum, Color: Silver </td> </tr>
##    </table>

Pros of HTML:

  1. Perfect for displaying data on web pages.

  2. Easy for humans to read in a browser. ### Cons of HTML:

  3. Not designed for storing large datasets.

  4. Bulky compared to other formats like JSON or Parquet.

Step 5: Converting the Data to XML

XML (eXtensible Markup Language) is another format for structured data, often used for exchanging information between systems. It’s similar to JSON but is more verbose and has strict rules for structure:

library(XML)
xml_doc <- newXMLDoc()
root <- newXMLNode("inventory", doc = xml_doc)
suppressWarnings(
# Add rows as nodes
for (i in 1:nrow(cunymart_data)) {
  item_node <- newXMLNode("item", parent = root)
  for (col in names(cunymart_data)) {
    newXMLNode(col, cunymart_data[i, col], parent = item_node)
  }
})

# Save XML
cunymart_xml <- saveXML(xml_doc)
cat(cunymart_xml)
## <?xml version="1.0"?>
## <inventory>
##   <item>
##     <Category>Electronics</Category>
##     <Item Name>Smartphone</Item Name>
##     <Item ID>101</Item ID>
##     <Brand>TechBrand</Brand>
##     <Price>699.99</Price>
##     <Variation ID>101-A</Variation ID>
##     <Variation Details>Color: Black, Storage: 64GB</Variation Details>
##   </item>
##   <item>
##     <Category>Electronics</Category>
##     <Item Name>Smartphone</Item Name>
##     <Item ID>101</Item ID>
##     <Brand>TechBrand</Brand>
##     <Price>699.99</Price>
##     <Variation ID>101-B</Variation ID>
##     <Variation Details>Color: White, Storage: 128GB</Variation Details>
##   </item>
##   <item>
##     <Category>Electronics</Category>
##     <Item Name>Laptop</Item Name>
##     <Item ID>102</Item ID>
##     <Brand>CompuBrand</Brand>
##     <Price>1099.99</Price>
##     <Variation ID>102-A</Variation ID>
##     <Variation Details>Color: Silver, Storage: 256GB</Variation Details>
##   </item>
##   <item>
##     <Category>Electronics</Category>
##     <Item Name>Laptop</Item Name>
##     <Item ID>102</Item ID>
##     <Brand>CompuBrand</Brand>
##     <Price>1099.99</Price>
##     <Variation ID>102-B</Variation ID>
##     <Variation Details>Color: Space Gray, Storage: 512GB</Variation Details>
##   </item>
##   <item>
##     <Category>Home Appliances</Category>
##     <Item Name>Refrigerator</Item Name>
##     <Item ID>201</Item ID>
##     <Brand>HomeCool</Brand>
##     <Price>899.99</Price>
##     <Variation ID>201-A</Variation ID>
##     <Variation Details>Color: Stainless Steel, Capacity: 20 cu ft</Variation Details>
##   </item>
##   <item>
##     <Category>Home Appliances</Category>
##     <Item Name>Refrigerator</Item Name>
##     <Item ID>201</Item ID>
##     <Brand>HomeCool</Brand>
##     <Price>899.99</Price>
##     <Variation ID>201-B</Variation ID>
##     <Variation Details>Color: White, Capacity: 18 cu ft</Variation Details>
##   </item>
##   <item>
##     <Category>Home Appliances</Category>
##     <Item Name>Washing Machine</Item Name>
##     <Item ID>202</Item ID>
##     <Brand>CleanTech</Brand>
##     <Price>499.99</Price>
##     <Variation ID>202-A</Variation ID>
##     <Variation Details>Type: Front Load, Capacity: 4.5 cu ft</Variation Details>
##   </item>
##   <item>
##     <Category>Home Appliances</Category>
##     <Item Name>Washing Machine</Item Name>
##     <Item ID>202</Item ID>
##     <Brand>CleanTech</Brand>
##     <Price>499.99</Price>
##     <Variation ID>202-B</Variation ID>
##     <Variation Details>Type: Top Load, Capacity: 5.0 cu ft</Variation Details>
##   </item>
##   <item>
##     <Category>Clothing</Category>
##     <Item Name>T-Shirt</Item Name>
##     <Item ID>301</Item ID>
##     <Brand>FashionCo</Brand>
##     <Price>19.99</Price>
##     <Variation ID>301-A</Variation ID>
##     <Variation Details>Color: Blue, Size: S</Variation Details>
##   </item>
##   <item>
##     <Category>Clothing</Category>
##     <Item Name>T-Shirt</Item Name>
##     <Item ID>301</Item ID>
##     <Brand>FashionCo</Brand>
##     <Price>19.99</Price>
##     <Variation ID>301-B</Variation ID>
##     <Variation Details>Color: Red, Size: M</Variation Details>
##   </item>
##   <item>
##     <Category>Clothing</Category>
##     <Item Name>T-Shirt</Item Name>
##     <Item ID>301</Item ID>
##     <Brand>FashionCo</Brand>
##     <Price>19.99</Price>
##     <Variation ID>301-C</Variation ID>
##     <Variation Details>Color: Green, Size: L</Variation Details>
##   </item>
##   <item>
##     <Category>Clothing</Category>
##     <Item Name>Jeans</Item Name>
##     <Item ID>302</Item ID>
##     <Brand>DenimWorks</Brand>
##     <Price>49.99</Price>
##     <Variation ID>302-A</Variation ID>
##     <Variation Details>Color: Dark Blue, Size: 32</Variation Details>
##   </item>
##   <item>
##     <Category>Clothing</Category>
##     <Item Name>Jeans</Item Name>
##     <Item ID>302</Item ID>
##     <Brand>DenimWorks</Brand>
##     <Price>49.99</Price>
##     <Variation ID>302-B</Variation ID>
##     <Variation Details>Color: Light Blue, Size: 34</Variation Details>
##   </item>
##   <item>
##     <Category>Books</Category>
##     <Item Name>Fiction Novel</Item Name>
##     <Item ID>401</Item ID>
##     <Brand>-</Brand>
##     <Price>14.99</Price>
##     <Variation ID>401-A</Variation ID>
##     <Variation Details>Format: Hardcover, Language: English</Variation Details>
##   </item>
##   <item>
##     <Category>Books</Category>
##     <Item Name>Fiction Novel</Item Name>
##     <Item ID>401</Item ID>
##     <Brand>-</Brand>
##     <Price>14.99</Price>
##     <Variation ID>401-B</Variation ID>
##     <Variation Details>Format: Paperback, Language: Spanish</Variation Details>
##   </item>
##   <item>
##     <Category>Books</Category>
##     <Item Name>Non-Fiction Guide</Item Name>
##     <Item ID>402</Item ID>
##     <Brand>-</Brand>
##     <Price>24.99</Price>
##     <Variation ID>402-A</Variation ID>
##     <Variation Details>Format: eBook, Language: English</Variation Details>
##   </item>
##   <item>
##     <Category>Books</Category>
##     <Item Name>Non-Fiction Guide</Item Name>
##     <Item ID>402</Item ID>
##     <Brand>-</Brand>
##     <Price>24.99</Price>
##     <Variation ID>402-B</Variation ID>
##     <Variation Details>Format: Paperback, Language: French</Variation Details>
##   </item>
##   <item>
##     <Category>Sports Equipment</Category>
##     <Item Name>Basketball</Item Name>
##     <Item ID>501</Item ID>
##     <Brand>SportsGear</Brand>
##     <Price>29.99</Price>
##     <Variation ID>501-A</Variation ID>
##     <Variation Details>Size: Size 7, Color: Orange</Variation Details>
##   </item>
##   <item>
##     <Category>Sports Equipment</Category>
##     <Item Name>Tennis Racket</Item Name>
##     <Item ID>502</Item ID>
##     <Brand>RacketPro</Brand>
##     <Price>89.99</Price>
##     <Variation ID>502-A</Variation ID>
##     <Variation Details>Material: Graphite, Color: Black</Variation Details>
##   </item>
##   <item>
##     <Category>Sports Equipment</Category>
##     <Item Name>Tennis Racket</Item Name>
##     <Item ID>502</Item ID>
##     <Brand>RacketPro</Brand>
##     <Price>89.99</Price>
##     <Variation ID>502-B</Variation ID>
##     <Variation Details>Material: Aluminum, Color: Silver</Variation Details>
##   </item>
## </inventory>

Pros of XML:

  1. Well-structured and highly standardized.
  2. Great for hierarchical data and data exchange.

Cons of XML:

  1. Very verbose, which makes files larger.
  2. Slower to parse compared to JSON.

Step 6: Converting the Data to Parquet

Parquet is a columnar storage format often used in big data processing. It’s optimized for storing and reading large datasets efficiently:

library(arrow)
## 
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
## 
##     timestamp
# Convert to Parquet format
write_parquet(cunymart_data, "cunymart_data.parquet")
# Reading the data back from the Parquet file
cunymart_parquet <- read_parquet("cunymart_data.parquet")
# Display the data
print(cunymart_parquet)
## # A tibble: 20 × 7
##    Category         `Item Name`       `Item ID` Brand       Price `Variation ID`
##  * <chr>            <chr>                 <dbl> <chr>       <dbl> <chr>         
##  1 Electronics      Smartphone              101 TechBrand   700.  101-A         
##  2 Electronics      Smartphone              101 TechBrand   700.  101-B         
##  3 Electronics      Laptop                  102 CompuBrand 1100.  102-A         
##  4 Electronics      Laptop                  102 CompuBrand 1100.  102-B         
##  5 Home Appliances  Refrigerator            201 HomeCool    900.  201-A         
##  6 Home Appliances  Refrigerator            201 HomeCool    900.  201-B         
##  7 Home Appliances  Washing Machine         202 CleanTech   500.  202-A         
##  8 Home Appliances  Washing Machine         202 CleanTech   500.  202-B         
##  9 Clothing         T-Shirt                 301 FashionCo    20.0 301-A         
## 10 Clothing         T-Shirt                 301 FashionCo    20.0 301-B         
## 11 Clothing         T-Shirt                 301 FashionCo    20.0 301-C         
## 12 Clothing         Jeans                   302 DenimWorks   50.0 302-A         
## 13 Clothing         Jeans                   302 DenimWorks   50.0 302-B         
## 14 Books            Fiction Novel           401 -            15.0 401-A         
## 15 Books            Fiction Novel           401 -            15.0 401-B         
## 16 Books            Non-Fiction Guide       402 -            25.0 402-A         
## 17 Books            Non-Fiction Guide       402 -            25.0 402-B         
## 18 Sports Equipment Basketball              501 SportsGear   30.0 501-A         
## 19 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-A         
## 20 Sports Equipment Tennis Racket           502 RacketPro    90.0 502-B         
## # ℹ 1 more variable: `Variation Details` <chr>

Pros of Parquet:

  1. Extremely efficient for large datasets.
  2. Faster read and write operations compared to row-based formats.

Cons of Parquet:

  1. Requires special libraries for handling.
  2. Not human-readable like JSON or XML.

Conclusion

In this report, we took the CUNYMart inventory dataset, imported it from text, and converted it into several useful formats: JSON, HTML, XML, and Parquet. Each format serves a specific purpose, depending on whether the data is for storage, machine-to-machine communication, or display on a webpage. JSON and Parquet are more efficient for data handling, while HTML is great for human readability, and XML offers a strict structure ideal for data exchange.