We have received the following data from CUNYMart, located at 123 Example Street, Anytown, USA.
Ask: “This data will be used for inventory analysis at the retailer. You are required to prepare the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, provide the pros and cons of each format.”
Let’s create a dataframe from the provided data:
library(tibble)
electronics_df <- tibble::tribble(
~Category, ~Item_Name, ~Item_ID, ~Brand, ~Price, ~Series_ID, ~Series_Details,
"Electronics", "Smartphone", 101, "TechBrand", 699.99, "101-A", "Color: Black, Storage: 64GB",
"Electronics", "Smartphone", 101, "TechBrand", 699.99, "101-B", "Color: White, Storage: 128GB",
"Electronics", "Laptop", 102, "CompuBrand", 1099.99, "102-A", "Color: Silver, Storage: 256GB",
"Electronics", "Laptop", 102, "CompuBrand", 1099.99, "102-B", "Color: Space Gray, Storage: 512GB",
"Home Appliances", "Refrigerator", 201, "HomeCool", 899.99, "201-A", "Color: Stainless Steel, Capacity: 20 cu ft",
"Home Appliances", "Refrigerator", 201, "HomeCool", 899.99, "201-B", "Color: White, Capacity: 18 cu ft",
"Home Appliances", "Washing Machine", 202, "CleanTech", 499.99, "202-A", "Type: Front Load, Capacity: 4.5 cu ft",
"Home Appliances", "Washing Machine", 202, "CleanTech", 499.99, "202-B", "Type: Top Load, Capacity: 5.0 cu ft",
"Clothing", "T-Shirt", 301, "FashionCo", 19.99, "301-A", "Color: Blue, Size: S",
"Clothing", "T-Shirt", 301, "FashionCo", 19.99, "301-B", "Color: Red, Size: M",
"Clothing", "T-Shirt", 301, "FashionCo", 19.99, "301-C", "Color: Green, Size: L",
"Clothing", "Jeans", 302, "DenimWorks", 49.99, "302-A", "Color: Dark Blue, Size: 32",
"Clothing", "Jeans", 302, "DenimWorks", 49.99, "302-B", "Color: Light Blue, Size: 34",
"Books", "Fiction Novel", 401, "-", 14.99, "401-A", "Format: Hardcover, Language: English",
"Books", "Fiction Novel", 401, "-", 14.99, "401-B", "Format: Paperback, Language: Spanish",
"Books", "Non-Fiction Guide", 402, "-", 24.99, "402-A", "Format: eBook, Language: English",
"Books", "Non-Fiction Guide", 402, "-", 24.99, "402-B", "Format: Paperback, Language: French",
"Sports Equipment", "Basketball", 501, "SportsGear", 29.99, "501-A", "Size: Size 7, Color: Orange",
"Sports Equipment", "Tennis Racket", 502, "RacketPro", 89.99, "502-A", "Material: Graphite, Color: Black",
"Sports Equipment", "Tennis Racket", 502, "RacketPro", 89.99, "502-B", "Material: Aluminum, Color: Silver"
)
print(electronics_df)
## # A tibble: 20 × 7
## Category Item_Name Item_ID Brand Price Series_ID Series_Details
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 Electronics Smartphone 101 Tech… 700. 101-A Color: Black,…
## 2 Electronics Smartphone 101 Tech… 700. 101-B Color: White,…
## 3 Electronics Laptop 102 Comp… 1100. 102-A Color: Silver…
## 4 Electronics Laptop 102 Comp… 1100. 102-B Color: Space …
## 5 Home Appliances Refrigerator 201 Home… 900. 201-A Color: Stainl…
## 6 Home Appliances Refrigerator 201 Home… 900. 201-B Color: White,…
## 7 Home Appliances Washing Machi… 202 Clea… 500. 202-A Type: Front L…
## 8 Home Appliances Washing Machi… 202 Clea… 500. 202-B Type: Top Loa…
## 9 Clothing T-Shirt 301 Fash… 20.0 301-A Color: Blue, …
## 10 Clothing T-Shirt 301 Fash… 20.0 301-B Color: Red, S…
## 11 Clothing T-Shirt 301 Fash… 20.0 301-C Color: Green,…
## 12 Clothing Jeans 302 Deni… 50.0 302-A Color: Dark B…
## 13 Clothing Jeans 302 Deni… 50.0 302-B Color: Light …
## 14 Books Fiction Novel 401 - 15.0 401-A Format: Hardc…
## 15 Books Fiction Novel 401 - 15.0 401-B Format: Paper…
## 16 Books Non-Fiction G… 402 - 25.0 402-A Format: eBook…
## 17 Books Non-Fiction G… 402 - 25.0 402-B Format: Paper…
## 18 Sports Equipment Basketball 501 Spor… 30.0 501-A Size: Size 7,…
## 19 Sports Equipment Tennis Racket 502 Rack… 90.0 502-A Material: Gra…
## 20 Sports Equipment Tennis Racket 502 Rack… 90.0 502-B Material: Alu…
Generate a JSON file from this dataset.
library(jsonlite)
json_data <- toJSON(electronics_df, pretty = TRUE)
write(json_data, file = "electronics.json")
Save the JSON file in github. GitHub Link: https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics.json
####1 - JSON Let’s load our generated JSON file into a dataframe named electronics_json.
library(jsonlite)
electronics_json <- fromJSON("https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics.json")
print(electronics_json)
## Category Item_Name Item_ID Brand Price Series_ID
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9 Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10 Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11 Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12 Clothing Jeans 302 DenimWorks 49.99 302-A
## 13 Clothing Jeans 302 DenimWorks 49.99 302-B
## 14 Books Fiction Novel 401 - 14.99 401-A
## 15 Books Fiction Novel 401 - 14.99 401-B
## 16 Books Non-Fiction Guide 402 - 24.99 402-A
## 17 Books Non-Fiction Guide 402 - 24.99 402-B
## 18 Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## Series_Details
## 1 Color: Black, Storage: 64GB
## 2 Color: White, Storage: 128GB
## 3 Color: Silver, Storage: 256GB
## 4 Color: Space Gray, Storage: 512GB
## 5 Color: Stainless Steel, Capacity: 20 cu ft
## 6 Color: White, Capacity: 18 cu ft
## 7 Type: Front Load, Capacity: 4.5 cu ft
## 8 Type: Top Load, Capacity: 5.0 cu ft
## 9 Color: Blue, Size: S
## 10 Color: Red, Size: M
## 11 Color: Green, Size: L
## 12 Color: Dark Blue, Size: 32
## 13 Color: Light Blue, Size: 34
## 14 Format: Hardcover, Language: English
## 15 Format: Paperback, Language: Spanish
## 16 Format: eBook, Language: English
## 17 Format: Paperback, Language: French
## 18 Size: Size 7, Color: Orange
## 19 Material: Graphite, Color: Black
## 20 Material: Aluminum, Color: Silver
####2 - HTML I will use a JSON file and a format converter to transform it into HTML format. Let’s then read our generated HTML file into a dataframe named electronics_html.
library(rvest)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url <- "https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics_html.html"
html_content <- read_html(url)
electronics_html <- html_content %>%
html_node("table") %>%
html_table(fill = TRUE)
print(electronics_html)
## # A tibble: 21 × 7
## Category Item_Name Item_ID Brand Price Series_ID Series_Details
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Electronics Smartphone 101 TechB… 699.… 101-A Color: Black,…
## 2 Electronics Smartphone 101 TechB… 699.… 101-B Color: White,…
## 3 Electronics Laptop 102 Compu… 1099… 102-A Color: Silver…
## 4 Electronics Laptop 102 Compu… 1099… 102-B Color: Space …
## 5 Home Appliances Refrigerator 201 HomeC… 899.… 201-A Color: Stainl…
## 6 Home Appliances Refrigerator 201 HomeC… 899.… 201-B Color: White,…
## 7 Home Appliances Washing Machine 202 Clean… 499.… 202-A Type: Front L…
## 8 Home Appliances Washing Machine 202 Clean… 499.… 202-B Type: Top Loa…
## 9 Clothing T-Shirt 301 Fashi… 19.99 301-A Color: Blue, …
## 10 Clothing T-Shirt 301 Fashi… 19.99 301-B Color: Red, S…
## # ℹ 11 more rows
I will use a JSON file and a format converter to transform it into XML format. Let’s then read our generated XML file into a dataframe named electronics_xml.
library(xml2)
url <- "https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics_xml.xml"
xml_content <- read_xml(url)
items <- xml_find_all(xml_content, "//item")
print(items)
## {xml_nodeset (20)}
## [1] <item>\n <Category>Electronics</Category>\n <Item_Name>Smartphone</Ite ...
## [2] <item>\n <Category>Electronics</Category>\n <Item_Name>Smartphone</Ite ...
## [3] <item>\n <Category>Electronics</Category>\n <Item_Name>Laptop</Item_Na ...
## [4] <item>\n <Category>Electronics</Category>\n <Item_Name>Laptop</Item_Na ...
## [5] <item>\n <Category>Home Appliances</Category>\n <Item_Name>Refrigerato ...
## [6] <item>\n <Category>Home Appliances</Category>\n <Item_Name>Refrigerato ...
## [7] <item>\n <Category>Home Appliances</Category>\n <Item_Name>Washing Mac ...
## [8] <item>\n <Category>Home Appliances</Category>\n <Item_Name>Washing Mac ...
## [9] <item>\n <Category>Clothing</Category>\n <Item_Name>T-Shirt</Item_Name ...
## [10] <item>\n <Category>Clothing</Category>\n <Item_Name>T-Shirt</Item_Name ...
## [11] <item>\n <Category>Clothing</Category>\n <Item_Name>T-Shirt</Item_Name ...
## [12] <item>\n <Category>Clothing</Category>\n <Item_Name>Jeans</Item_Name>\ ...
## [13] <item>\n <Category>Clothing</Category>\n <Item_Name>Jeans</Item_Name>\ ...
## [14] <item>\n <Category>Books</Category>\n <Item_Name>Fiction Novel</Item_N ...
## [15] <item>\n <Category>Books</Category>\n <Item_Name>Fiction Novel</Item_N ...
## [16] <item>\n <Category>Books</Category>\n <Item_Name>Non-Fiction Guide</It ...
## [17] <item>\n <Category>Books</Category>\n <Item_Name>Non-Fiction Guide</It ...
## [18] <item>\n <Category>Sports Equipment</Category>\n <Item_Name>Basketball ...
## [19] <item>\n <Category>Sports Equipment</Category>\n <Item_Name>Tennis Rac ...
## [20] <item>\n <Category>Sports Equipment</Category>\n <Item_Name>Tennis Rac ...
categories <- xml_text(xml_find_all(items, "Category"))
item_names <- xml_text(xml_find_all(items, "Item_Name"))
item_ids <- xml_text(xml_find_all(items, "Item_ID"))
brands <- xml_text(xml_find_all(items, "Brand"))
prices <- xml_text(xml_find_all(items, "Price"))
series_ids <- xml_text(xml_find_all(items, "Series_ID"))
series_details <- xml_text(xml_find_all(items, "Series_Details"))
electronics_xml <- data.frame(
Category = categories,
Item_Name = item_names,
Item_ID = item_ids,
Brand = brands,
Price = prices,
Series_ID = series_ids,
Series_Details = series_details,
stringsAsFactors = FALSE
)
print(electronics_xml)
## Category Item_Name Item_ID Brand Price Series_ID
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9 Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10 Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11 Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12 Clothing Jeans 302 DenimWorks 49.99 302-A
## 13 Clothing Jeans 302 DenimWorks 49.99 302-B
## 14 Books Fiction Novel 401 - 14.99 401-A
## 15 Books Fiction Novel 401 - 14.99 401-B
## 16 Books Non-Fiction Guide 402 - 24.99 402-A
## 17 Books Non-Fiction Guide 402 - 24.99 402-B
## 18 Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## Series_Details
## 1 Color: Black, Storage: 64GB
## 2 Color: White, Storage: 128GB
## 3 Color: Silver, Storage: 256GB
## 4 Color: Space Gray, Storage: 512GB
## 5 Color: Stainless Steel, Capacity: 20 cu ft
## 6 Color: White, Capacity: 18 cu ft
## 7 Type: Front Load, Capacity: 4.5 cu ft
## 8 Type: Top Load, Capacity: 5.0 cu ft
## 9 Color: Blue, Size: S
## 10 Color: Red, Size: M
## 11 Color: Green, Size: L
## 12 Color: Dark Blue, Size: 32
## 13 Color: Light Blue, Size: 34
## 14 Format: Hardcover, Language: English
## 15 Format: Paperback, Language: Spanish
## 16 Format: eBook, Language: English
## 17 Format: Paperback, Language: French
## 18 Size: Size 7, Color: Orange
## 19 Material: Graphite, Color: Black
## 20 Material: Aluminum, Color: Silver
I will use a JSON file and a format converter to transform it into Parquet format. Let’s then read our generated Parquet file into a dataframe named electronics_parquet.
library(arrow)
##
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
##
## timestamp
url <- "https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics.parquet"
temp_file <- tempfile(fileext = ".parquet")
download.file(url, temp_file, mode = "wb")
electronics_parquet <- read_parquet(temp_file)
print(electronics_parquet)
## # A tibble: 20 × 7
## Category Item_Name Item_ID Brand Price Series_ID Series_Details
## <chr> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 Electronics Smartphone 101 Tech… 700. 101-A Color: Black,…
## 2 Electronics Smartphone 101 Tech… 700. 101-B Color: White,…
## 3 Electronics Laptop 102 Comp… 1100. 102-A Color: Silver…
## 4 Electronics Laptop 102 Comp… 1100. 102-B Color: Space …
## 5 Home Appliances Refrigerator 201 Home… 900. 201-A Color: Stainl…
## 6 Home Appliances Refrigerator 201 Home… 900. 201-B Color: White,…
## 7 Home Appliances Washing Machi… 202 Clea… 500. 202-A Type: Front L…
## 8 Home Appliances Washing Machi… 202 Clea… 500. 202-B Type: Top Loa…
## 9 Clothing T-Shirt 301 Fash… 20.0 301-A Color: Blue, …
## 10 Clothing T-Shirt 301 Fash… 20.0 301-B Color: Red, S…
## 11 Clothing T-Shirt 301 Fash… 20.0 301-C Color: Green,…
## 12 Clothing Jeans 302 Deni… 50.0 302-A Color: Dark B…
## 13 Clothing Jeans 302 Deni… 50.0 302-B Color: Light …
## 14 Books Fiction Novel 401 - 15.0 401-A Format: Hardc…
## 15 Books Fiction Novel 401 - 15.0 401-B Format: Paper…
## 16 Books Non-Fiction G… 402 - 25.0 402-A Format: eBook…
## 17 Books Non-Fiction G… 402 - 25.0 402-B Format: Paper…
## 18 Sports Equipment Basketball 501 Spor… 30.0 501-A Size: Size 7,…
## 19 Sports Equipment Tennis Racket 502 Rack… 90.0 502-A Material: Gra…
## 20 Sports Equipment Tennis Racket 502 Rack… 90.0 502-B Material: Alu…
Pros: Easy to read, lightweight, widely supported. Cons: Can be large, no schema, limited data types.
Pros: Great for web pages, rich formatting. Cons: Not for data storage, bulky, limited structure.
Pros: Self-descriptive, flexible, supports schema. Cons: Very large, complex to parse, slower.
Pros: Efficient storage, good compression, fast for big data. Cons: Not human-readable, complex tools needed, less support.
In conclusion, choosing the right data format depends on our specific needs and context. JSON is ideal for lightweight data interchange due to its simplicity and readability. HTML excels in presenting data on web pages with rich formatting options. XML offers flexibility and self-descriptive tags, making it suitable for complex hierarchical data structures, though it can be verbose and slower to process. Parquet stands out for its efficiency in storing and processing large-scale data, especially in analytical scenarios, despite requiring more sophisticated tools and being less human-readable.
Each format has its unique strengths and weaknesses, making them suitable for different applications.