Overview

Delimited text retail inventory data was provided to prepare for analysis by formatting it in JSON, HTML, XML, and Parquet. An overview of the pros and cons of each of these file formats is also included.


Generate Data

Create string of all data provided without the first header row

txt_data = '
Electronics,Smartphone,101,TechBrand,699.99,101-A,Color: Black, Storage: 64GB
Electronics,Smartphone,101,TechBrand,699.99,101-B,Color: White, Storage: 128GB
Electronics,Laptop,102,CompuBrand,1099.99,102-A,Color: Silver, Storage: 256GB
Electronics,Laptop,102,CompuBrand,1099.99,102-B,Color: Space Gray, Storage: 512GB
Home Appliances,Refrigerator,201,HomeCool,899.99,201-A,Color: Stainless Steel, Capacity:20 cu ft
Home Appliances,Refrigerator,201,HomeCool,899.99,201-B,Color: White, Capacity: 18 cu ft
Home Appliances,Washing Machine,202,CleanTech,499.99,202-A,Type: Front Load, Capacity:4.5 cu ft
Home Appliances,Washing Machine,202,CleanTech,499.99,202-B,Type: Top Load, Capacity:5.0 cu ft
Clothing,T-Shirt,301,FashionCo,19.99,301-A,Color: Blue, Size: S
Clothing,T-Shirt,301,FashionCo,19.99,301-B,Color: Red, Size: M
Clothing,T-Shirt,301,FashionCo,19.99,301-C,Color: Green, Size: L
Clothing,Jeans,302,DenimWorks,49.99,302-A,Color: Dark Blue, Size: 32
Clothing,Jeans,302,DenimWorks,49.99,302-B,Color: Light Blue, Size: 34
Books,Fiction Novel,401,-,14.99,401-A,Format: Hardcover, Language: English
Books,Fiction Novel,401,-,14.99,401-B,Format: Paperback, Language: Spanish
Books,Non-Fiction Guide,402,-,24.99,402-A,Format: eBook, Language: English
Books,Non-Fiction Guide,402,-,24.99,402-B,Format: Paperback, Language: French
Sports Equipment,Basketball,501,SportsGear,29.99,501-A,Size: Size 7, Color: Orange
Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-A,Material: Graphite, Color: Black
Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-B,Material: Aluminum, Color: Silver'

Input text data into data frame, define column names, and merge the last two Variation_Details columns since this column was incorrectly separated when transformed into a data table as result of this column including a comma

df = read.table(
  text = txt_data,
  sep = ',',
  header = FALSE,
  col.names = c(
    'Category', 'Item_Name', 'Item ID', 'Brand',
    'Price', 'Variation_ID', 'Variation_Details_1',
    'Variation_Details_2'))

df = df |>
  unite('Variation_Details', 'Variation_Details_1', 'Variation_Details_2', sep = ", ")


JSON

Prepare and output JSON file from data frame

json_data = toJSON(df, pretty=T)
write(json_data, file = "DATA607_Week 7 Assignment_JSON.json")
json_data
## [
##   {
##     "Category": "Electronics",
##     "Item_Name": "Smartphone",
##     "Item.ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation_ID": "101-A",
##     "Variation_Details": "Color: Black,  Storage: 64GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item_Name": "Smartphone",
##     "Item.ID": 101,
##     "Brand": "TechBrand",
##     "Price": 699.99,
##     "Variation_ID": "101-B",
##     "Variation_Details": "Color: White,  Storage: 128GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item_Name": "Laptop",
##     "Item.ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation_ID": "102-A",
##     "Variation_Details": "Color: Silver,  Storage: 256GB"
##   },
##   {
##     "Category": "Electronics",
##     "Item_Name": "Laptop",
##     "Item.ID": 102,
##     "Brand": "CompuBrand",
##     "Price": 1099.99,
##     "Variation_ID": "102-B",
##     "Variation_Details": "Color: Space Gray,  Storage: 512GB"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item_Name": "Refrigerator",
##     "Item.ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation_ID": "201-A",
##     "Variation_Details": "Color: Stainless Steel,  Capacity:20 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item_Name": "Refrigerator",
##     "Item.ID": 201,
##     "Brand": "HomeCool",
##     "Price": 899.99,
##     "Variation_ID": "201-B",
##     "Variation_Details": "Color: White,  Capacity: 18 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item_Name": "Washing Machine",
##     "Item.ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation_ID": "202-A",
##     "Variation_Details": "Type: Front Load,  Capacity:4.5 cu ft"
##   },
##   {
##     "Category": "Home Appliances",
##     "Item_Name": "Washing Machine",
##     "Item.ID": 202,
##     "Brand": "CleanTech",
##     "Price": 499.99,
##     "Variation_ID": "202-B",
##     "Variation_Details": "Type: Top Load,  Capacity:5.0 cu ft"
##   },
##   {
##     "Category": "Clothing",
##     "Item_Name": "T-Shirt",
##     "Item.ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation_ID": "301-A",
##     "Variation_Details": "Color: Blue,  Size: S"
##   },
##   {
##     "Category": "Clothing",
##     "Item_Name": "T-Shirt",
##     "Item.ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation_ID": "301-B",
##     "Variation_Details": "Color: Red,  Size: M"
##   },
##   {
##     "Category": "Clothing",
##     "Item_Name": "T-Shirt",
##     "Item.ID": 301,
##     "Brand": "FashionCo",
##     "Price": 19.99,
##     "Variation_ID": "301-C",
##     "Variation_Details": "Color: Green,  Size: L"
##   },
##   {
##     "Category": "Clothing",
##     "Item_Name": "Jeans",
##     "Item.ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation_ID": "302-A",
##     "Variation_Details": "Color: Dark Blue,  Size: 32"
##   },
##   {
##     "Category": "Clothing",
##     "Item_Name": "Jeans",
##     "Item.ID": 302,
##     "Brand": "DenimWorks",
##     "Price": 49.99,
##     "Variation_ID": "302-B",
##     "Variation_Details": "Color: Light Blue,  Size: 34"
##   },
##   {
##     "Category": "Books",
##     "Item_Name": "Fiction Novel",
##     "Item.ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation_ID": "401-A",
##     "Variation_Details": "Format: Hardcover,  Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item_Name": "Fiction Novel",
##     "Item.ID": 401,
##     "Brand": "-",
##     "Price": 14.99,
##     "Variation_ID": "401-B",
##     "Variation_Details": "Format: Paperback,  Language: Spanish"
##   },
##   {
##     "Category": "Books",
##     "Item_Name": "Non-Fiction Guide",
##     "Item.ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation_ID": "402-A",
##     "Variation_Details": "Format: eBook,  Language: English"
##   },
##   {
##     "Category": "Books",
##     "Item_Name": "Non-Fiction Guide",
##     "Item.ID": 402,
##     "Brand": "-",
##     "Price": 24.99,
##     "Variation_ID": "402-B",
##     "Variation_Details": "Format: Paperback,  Language: French"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item_Name": "Basketball",
##     "Item.ID": 501,
##     "Brand": "SportsGear",
##     "Price": 29.99,
##     "Variation_ID": "501-A",
##     "Variation_Details": "Size: Size 7,  Color: Orange"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item_Name": "Tennis Racket",
##     "Item.ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation_ID": "502-A",
##     "Variation_Details": "Material: Graphite,  Color: Black"
##   },
##   {
##     "Category": "Sports Equipment",
##     "Item_Name": "Tennis Racket",
##     "Item.ID": 502,
##     "Brand": "RacketPro",
##     "Price": 89.99,
##     "Variation_ID": "502-B",
##     "Variation_Details": "Material: Aluminum,  Color: Silver"
##   }
## ]


XML

Prepare and output XML file from data frame

df_to_xml(
  df, fields="tags",
  record.tag = "Item", root.node='Inventory',
  xml.file = "DATA607_Week 7 Assignment_XML.xml")
## {xml_document}
## <Inventory encoding="UTF-8">
##  [1] <Item>\n  <Category>Electronics</Category>\n  <Item_Name>Smartphone</Ite ...
##  [2] <Item>\n  <Category>Electronics</Category>\n  <Item_Name>Smartphone</Ite ...
##  [3] <Item>\n  <Category>Electronics</Category>\n  <Item_Name>Laptop</Item_Na ...
##  [4] <Item>\n  <Category>Electronics</Category>\n  <Item_Name>Laptop</Item_Na ...
##  [5] <Item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Refrigerato ...
##  [6] <Item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Refrigerato ...
##  [7] <Item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Washing Mac ...
##  [8] <Item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Washing Mac ...
##  [9] <Item>\n  <Category>Clothing</Category>\n  <Item_Name>T-Shirt</Item_Name ...
## [10] <Item>\n  <Category>Clothing</Category>\n  <Item_Name>T-Shirt</Item_Name ...
## [11] <Item>\n  <Category>Clothing</Category>\n  <Item_Name>T-Shirt</Item_Name ...
## [12] <Item>\n  <Category>Clothing</Category>\n  <Item_Name>Jeans</Item_Name>\ ...
## [13] <Item>\n  <Category>Clothing</Category>\n  <Item_Name>Jeans</Item_Name>\ ...
## [14] <Item>\n  <Category>Books</Category>\n  <Item_Name>Fiction Novel</Item_N ...
## [15] <Item>\n  <Category>Books</Category>\n  <Item_Name>Fiction Novel</Item_N ...
## [16] <Item>\n  <Category>Books</Category>\n  <Item_Name>Non-Fiction Guide</It ...
## [17] <Item>\n  <Category>Books</Category>\n  <Item_Name>Non-Fiction Guide</It ...
## [18] <Item>\n  <Category>Sports Equipment</Category>\n  <Item_Name>Basketball ...
## [19] <Item>\n  <Category>Sports Equipment</Category>\n  <Item_Name>Tennis Rac ...
## [20] <Item>\n  <Category>Sports Equipment</Category>\n  <Item_Name>Tennis Rac ...


HTML

Prepare and output HTML table from data frame

html_data = df |>
  tableHTML(
    rownames=F,
    border=1,
    widths = rep(150, ncol(df)),    ) |>
  add_css_header(css = list(
    c('background-color', 'color'),
    c('#2d3c52', 'white')
    ), headers = 1:ncol(df)) |>
  add_css_row(css = list(
    c('font-family', 'font-size'),
    c('arial', '14px'))) 
write(html_data, file = "DATA607_Week 7 Assignment_HTML.html")
html_data
Category Item_Name Item.ID Brand Price Variation_ID Variation_Details
Electronics Smartphone 101 TechBrand 699.99 101-A Color: Black, Storage: 64GB
Electronics Smartphone 101 TechBrand 699.99 101-B Color: White, Storage: 128GB
Electronics Laptop 102 CompuBrand 1099.99 102-A Color: Silver, Storage: 256GB
Electronics Laptop 102 CompuBrand 1099.99 102-B Color: Space Gray, Storage: 512GB
Home Appliances Refrigerator 201 HomeCool 899.99 201-A Color: Stainless Steel, Capacity:20 cu ft
Home Appliances Refrigerator 201 HomeCool 899.99 201-B Color: White, Capacity: 18 cu ft
Home Appliances Washing Machine 202 CleanTech 499.99 202-A Type: Front Load, Capacity:4.5 cu ft
Home Appliances Washing Machine 202 CleanTech 499.99 202-B Type: Top Load, Capacity:5.0 cu ft
Clothing T-Shirt 301 FashionCo 19.99 301-A Color: Blue, Size: S
Clothing T-Shirt 301 FashionCo 19.99 301-B Color: Red, Size: M
Clothing T-Shirt 301 FashionCo 19.99 301-C Color: Green, Size: L
Clothing Jeans 302 DenimWorks 49.99 302-A Color: Dark Blue, Size: 32
Clothing Jeans 302 DenimWorks 49.99 302-B Color: Light Blue, Size: 34
Books Fiction Novel 401 - 14.99 401-A Format: Hardcover, Language: English
Books Fiction Novel 401 - 14.99 401-B Format: Paperback, Language: Spanish
Books Non-Fiction Guide 402 - 24.99 402-A Format: eBook, Language: English
Books Non-Fiction Guide 402 - 24.99 402-B Format: Paperback, Language: French
Sports Equipment Basketball 501 SportsGear 29.99 501-A Size: Size 7, Color: Orange
Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A Material: Graphite, Color: Black
Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B Material: Aluminum, Color: Silver


Parquet

Prepare and output parquet file from data frame

write_parquet(df, sink = "DATA607_Week 7 Assignment_PARQUET.parquet")
parquet_data = read_parquet("DATA607_Week 7 Assignment_PARQUET.parquet")
parquet_data
## # A tibble: 20 × 7
##    Category        Item_Name Item.ID Brand  Price Variation_ID Variation_Details
##    <chr>           <chr>       <int> <chr>  <dbl> <chr>        <chr>            
##  1 Electronics     Smartpho…     101 Tech…  700.  101-A        Color: Black,  S…
##  2 Electronics     Smartpho…     101 Tech…  700.  101-B        Color: White,  S…
##  3 Electronics     Laptop        102 Comp… 1100.  102-A        Color: Silver,  …
##  4 Electronics     Laptop        102 Comp… 1100.  102-B        Color: Space Gra…
##  5 Home Appliances Refriger…     201 Home…  900.  201-A        Color: Stainless…
##  6 Home Appliances Refriger…     201 Home…  900.  201-B        Color: White,  C…
##  7 Home Appliances Washing …     202 Clea…  500.  202-A        Type: Front Load…
##  8 Home Appliances Washing …     202 Clea…  500.  202-B        Type: Top Load, …
##  9 Clothing        T-Shirt       301 Fash…   20.0 301-A        Color: Blue,  Si…
## 10 Clothing        T-Shirt       301 Fash…   20.0 301-B        Color: Red,  Siz…
## 11 Clothing        T-Shirt       301 Fash…   20.0 301-C        Color: Green,  S…
## 12 Clothing        Jeans         302 Deni…   50.0 302-A        Color: Dark Blue…
## 13 Clothing        Jeans         302 Deni…   50.0 302-B        Color: Light Blu…
## 14 Books           Fiction …     401 -       15.0 401-A        Format: Hardcove…
## 15 Books           Fiction …     401 -       15.0 401-B        Format: Paperbac…
## 16 Books           Non-Fict…     402 -       25.0 402-A        Format: eBook,  …
## 17 Books           Non-Fict…     402 -       25.0 402-B        Format: Paperbac…
## 18 Sports Equipme… Basketba…     501 Spor…   30.0 501-A        Size: Size 7,  C…
## 19 Sports Equipme… Tennis R…     502 Rack…   90.0 502-A        Material: Graphi…
## 20 Sports Equipme… Tennis R…     502 Rack…   90.0 502-B        Material: Alumin…

Conclusions

File_Format PROS CONS
Parquet
  • Efficent encoding allows for smaller file sizes and faster files
  • Stores record data types
  • Files are “column-oriented” and “chunked” which typically leads to better performance and more efficient analysis
  • Not “human readable”
  • Not all programming languages or applications may support it
JSON
  • Widely compatible and adopted across web technologies and APIs
  • Ideal for representing simple hierarchical data structures
  • No native way to represent dates or date-times
  • Impercise rules for representing floating point numbers
  • less optimal and readable for large/complex datasets
XML
  • Text-based and human readable
  • Suitable for detailed datasets due to hierarchical data structure
  • Consistent format for data exchange
    • Less efficient for large datasets compared
    • Doesn’t have built-in support for numbers, dates, or other complex data types
    • Can be slower to parse and process, especially for large datasets
    • Harder to understand and parse with heavily nested architectures
    HTML
    • Standard language for most web pages
    • Universally supported across browsers and tools
    • Inefficient for structured data handling or data transfer purposes
    • Relies on CSS and JavaScript for the final presentation