Introduction

Assignment: working with JSON, HTML, XML, and Parquet in R

We have received the following data from CUNYMart, located at 123 Example Street, Anytown, USA.

Ask: “This data will be used for inventory analysis at the retailer. You are required to prepare the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, provide the pros and cons of each format.”

Let’s create a dataframe from the provided data:

library(tibble)

electronics_df <- tibble::tribble(
  ~Category, ~Item_Name, ~Item_ID, ~Brand, ~Price, ~Series_ID, ~Series_Details,
  "Electronics", "Smartphone", 101, "TechBrand", 699.99, "101-A", "Color: Black, Storage: 64GB",
  "Electronics", "Smartphone", 101, "TechBrand", 699.99, "101-B", "Color: White, Storage: 128GB",
  "Electronics", "Laptop", 102, "CompuBrand", 1099.99, "102-A", "Color: Silver, Storage: 256GB",
  "Electronics", "Laptop", 102, "CompuBrand", 1099.99, "102-B", "Color: Space Gray, Storage: 512GB",
  "Home Appliances", "Refrigerator", 201, "HomeCool", 899.99, "201-A", "Color: Stainless Steel, Capacity: 20 cu ft",
  "Home Appliances", "Refrigerator", 201, "HomeCool", 899.99, "201-B", "Color: White, Capacity: 18 cu ft",
  "Home Appliances", "Washing Machine", 202, "CleanTech", 499.99, "202-A", "Type: Front Load, Capacity: 4.5 cu ft",
  "Home Appliances", "Washing Machine", 202, "CleanTech", 499.99, "202-B", "Type: Top Load, Capacity: 5.0 cu ft",
  "Clothing", "T-Shirt", 301, "FashionCo", 19.99, "301-A", "Color: Blue, Size: S",
  "Clothing", "T-Shirt", 301, "FashionCo", 19.99, "301-B", "Color: Red, Size: M",
  "Clothing", "T-Shirt", 301, "FashionCo", 19.99, "301-C", "Color: Green, Size: L",
  "Clothing", "Jeans", 302, "DenimWorks", 49.99, "302-A", "Color: Dark Blue, Size: 32",
  "Clothing", "Jeans", 302, "DenimWorks", 49.99, "302-B", "Color: Light Blue, Size: 34",
  "Books", "Fiction Novel", 401, "-", 14.99, "401-A", "Format: Hardcover, Language: English",
  "Books", "Fiction Novel", 401, "-", 14.99, "401-B", "Format: Paperback, Language: Spanish",
  "Books", "Non-Fiction Guide", 402, "-", 24.99, "402-A", "Format: eBook, Language: English",
  "Books", "Non-Fiction Guide", 402, "-", 24.99, "402-B", "Format: Paperback, Language: French",
  "Sports Equipment", "Basketball", 501, "SportsGear", 29.99, "501-A", "Size: Size 7, Color: Orange",
  "Sports Equipment", "Tennis Racket", 502, "RacketPro", 89.99, "502-A", "Material: Graphite, Color: Black",
  "Sports Equipment", "Tennis Racket", 502, "RacketPro", 89.99, "502-B", "Material: Aluminum, Color: Silver"
)

print(electronics_df)
## # A tibble: 20 × 7
##    Category         Item_Name      Item_ID Brand  Price Series_ID Series_Details
##    <chr>            <chr>            <dbl> <chr>  <dbl> <chr>     <chr>         
##  1 Electronics      Smartphone         101 Tech…  700.  101-A     Color: Black,…
##  2 Electronics      Smartphone         101 Tech…  700.  101-B     Color: White,…
##  3 Electronics      Laptop             102 Comp… 1100.  102-A     Color: Silver…
##  4 Electronics      Laptop             102 Comp… 1100.  102-B     Color: Space …
##  5 Home Appliances  Refrigerator       201 Home…  900.  201-A     Color: Stainl…
##  6 Home Appliances  Refrigerator       201 Home…  900.  201-B     Color: White,…
##  7 Home Appliances  Washing Machi…     202 Clea…  500.  202-A     Type: Front L…
##  8 Home Appliances  Washing Machi…     202 Clea…  500.  202-B     Type: Top Loa…
##  9 Clothing         T-Shirt            301 Fash…   20.0 301-A     Color: Blue, …
## 10 Clothing         T-Shirt            301 Fash…   20.0 301-B     Color: Red, S…
## 11 Clothing         T-Shirt            301 Fash…   20.0 301-C     Color: Green,…
## 12 Clothing         Jeans              302 Deni…   50.0 302-A     Color: Dark B…
## 13 Clothing         Jeans              302 Deni…   50.0 302-B     Color: Light …
## 14 Books            Fiction Novel      401 -       15.0 401-A     Format: Hardc…
## 15 Books            Fiction Novel      401 -       15.0 401-B     Format: Paper…
## 16 Books            Non-Fiction G…     402 -       25.0 402-A     Format: eBook…
## 17 Books            Non-Fiction G…     402 -       25.0 402-B     Format: Paper…
## 18 Sports Equipment Basketball         501 Spor…   30.0 501-A     Size: Size 7,…
## 19 Sports Equipment Tennis Racket      502 Rack…   90.0 502-A     Material: Gra…
## 20 Sports Equipment Tennis Racket      502 Rack…   90.0 502-B     Material: Alu…

Generate a JSON file from this dataset.

library(jsonlite)
json_data <- toJSON(electronics_df, pretty = TRUE)
write(json_data, file = "electronics.json")

Save the JSON file in github. GitHub Link: https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics.json

####1 - JSON Let’s load our generated JSON file into a dataframe named electronics_json.

library(jsonlite)
electronics_json <- fromJSON("https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics.json")
print(electronics_json)
##            Category         Item_Name Item_ID      Brand   Price Series_ID
## 1       Electronics        Smartphone     101  TechBrand  699.99     101-A
## 2       Electronics        Smartphone     101  TechBrand  699.99     101-B
## 3       Electronics            Laptop     102 CompuBrand 1099.99     102-A
## 4       Electronics            Laptop     102 CompuBrand 1099.99     102-B
## 5   Home Appliances      Refrigerator     201   HomeCool  899.99     201-A
## 6   Home Appliances      Refrigerator     201   HomeCool  899.99     201-B
## 7   Home Appliances   Washing Machine     202  CleanTech  499.99     202-A
## 8   Home Appliances   Washing Machine     202  CleanTech  499.99     202-B
## 9          Clothing           T-Shirt     301  FashionCo   19.99     301-A
## 10         Clothing           T-Shirt     301  FashionCo   19.99     301-B
## 11         Clothing           T-Shirt     301  FashionCo   19.99     301-C
## 12         Clothing             Jeans     302 DenimWorks   49.99     302-A
## 13         Clothing             Jeans     302 DenimWorks   49.99     302-B
## 14            Books     Fiction Novel     401          -   14.99     401-A
## 15            Books     Fiction Novel     401          -   14.99     401-B
## 16            Books Non-Fiction Guide     402          -   24.99     402-A
## 17            Books Non-Fiction Guide     402          -   24.99     402-B
## 18 Sports Equipment        Basketball     501 SportsGear   29.99     501-A
## 19 Sports Equipment     Tennis Racket     502  RacketPro   89.99     502-A
## 20 Sports Equipment     Tennis Racket     502  RacketPro   89.99     502-B
##                                Series_Details
## 1                 Color: Black, Storage: 64GB
## 2                Color: White, Storage: 128GB
## 3               Color: Silver, Storage: 256GB
## 4           Color: Space Gray, Storage: 512GB
## 5  Color: Stainless Steel, Capacity: 20 cu ft
## 6            Color: White, Capacity: 18 cu ft
## 7       Type: Front Load, Capacity: 4.5 cu ft
## 8         Type: Top Load, Capacity: 5.0 cu ft
## 9                        Color: Blue, Size: S
## 10                        Color: Red, Size: M
## 11                      Color: Green, Size: L
## 12                 Color: Dark Blue, Size: 32
## 13                Color: Light Blue, Size: 34
## 14       Format: Hardcover, Language: English
## 15       Format: Paperback, Language: Spanish
## 16           Format: eBook, Language: English
## 17        Format: Paperback, Language: French
## 18                Size: Size 7, Color: Orange
## 19           Material: Graphite, Color: Black
## 20          Material: Aluminum, Color: Silver

####2 - HTML I will use a JSON file and a format converter to transform it into HTML format. Let’s then read our generated HTML file into a dataframe named electronics_html.

library(rvest)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
url <- "https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics_html.html"
html_content <- read_html(url)

electronics_html <- html_content %>%
  html_node("table") %>%
  html_table(fill = TRUE)
print(electronics_html)
## # A tibble: 21 × 7
##    Category        Item_Name       Item_ID Brand  Price Series_ID Series_Details
##    <chr>           <chr>           <chr>   <chr>  <chr> <chr>     <chr>         
##  1 Electronics     Smartphone      101     TechB… 699.… 101-A     Color: Black,…
##  2 Electronics     Smartphone      101     TechB… 699.… 101-B     Color: White,…
##  3 Electronics     Laptop          102     Compu… 1099… 102-A     Color: Silver…
##  4 Electronics     Laptop          102     Compu… 1099… 102-B     Color: Space …
##  5 Home Appliances Refrigerator    201     HomeC… 899.… 201-A     Color: Stainl…
##  6 Home Appliances Refrigerator    201     HomeC… 899.… 201-B     Color: White,…
##  7 Home Appliances Washing Machine 202     Clean… 499.… 202-A     Type: Front L…
##  8 Home Appliances Washing Machine 202     Clean… 499.… 202-B     Type: Top Loa…
##  9 Clothing        T-Shirt         301     Fashi… 19.99 301-A     Color: Blue, …
## 10 Clothing        T-Shirt         301     Fashi… 19.99 301-B     Color: Red, S…
## # ℹ 11 more rows

3- XML

I will use a JSON file and a format converter to transform it into XML format. Let’s then read our generated XML file into a dataframe named electronics_xml.

library(xml2)
url <- "https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics_xml.xml"
xml_content <- read_xml(url)
items <- xml_find_all(xml_content, "//item")
print(items)
## {xml_nodeset (20)}
##  [1] <item>\n  <Category>Electronics</Category>\n  <Item_Name>Smartphone</Ite ...
##  [2] <item>\n  <Category>Electronics</Category>\n  <Item_Name>Smartphone</Ite ...
##  [3] <item>\n  <Category>Electronics</Category>\n  <Item_Name>Laptop</Item_Na ...
##  [4] <item>\n  <Category>Electronics</Category>\n  <Item_Name>Laptop</Item_Na ...
##  [5] <item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Refrigerato ...
##  [6] <item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Refrigerato ...
##  [7] <item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Washing Mac ...
##  [8] <item>\n  <Category>Home Appliances</Category>\n  <Item_Name>Washing Mac ...
##  [9] <item>\n  <Category>Clothing</Category>\n  <Item_Name>T-Shirt</Item_Name ...
## [10] <item>\n  <Category>Clothing</Category>\n  <Item_Name>T-Shirt</Item_Name ...
## [11] <item>\n  <Category>Clothing</Category>\n  <Item_Name>T-Shirt</Item_Name ...
## [12] <item>\n  <Category>Clothing</Category>\n  <Item_Name>Jeans</Item_Name>\ ...
## [13] <item>\n  <Category>Clothing</Category>\n  <Item_Name>Jeans</Item_Name>\ ...
## [14] <item>\n  <Category>Books</Category>\n  <Item_Name>Fiction Novel</Item_N ...
## [15] <item>\n  <Category>Books</Category>\n  <Item_Name>Fiction Novel</Item_N ...
## [16] <item>\n  <Category>Books</Category>\n  <Item_Name>Non-Fiction Guide</It ...
## [17] <item>\n  <Category>Books</Category>\n  <Item_Name>Non-Fiction Guide</It ...
## [18] <item>\n  <Category>Sports Equipment</Category>\n  <Item_Name>Basketball ...
## [19] <item>\n  <Category>Sports Equipment</Category>\n  <Item_Name>Tennis Rac ...
## [20] <item>\n  <Category>Sports Equipment</Category>\n  <Item_Name>Tennis Rac ...
categories <- xml_text(xml_find_all(items, "Category"))
item_names <- xml_text(xml_find_all(items, "Item_Name"))
item_ids <- xml_text(xml_find_all(items, "Item_ID"))
brands <- xml_text(xml_find_all(items, "Brand"))
prices <- xml_text(xml_find_all(items, "Price"))
series_ids <- xml_text(xml_find_all(items, "Series_ID"))
series_details <- xml_text(xml_find_all(items, "Series_Details"))

electronics_xml <- data.frame(
  Category = categories,
  Item_Name = item_names,
  Item_ID = item_ids,
  Brand = brands,
  Price = prices,
  Series_ID = series_ids,
  Series_Details = series_details,
  stringsAsFactors = FALSE
)

print(electronics_xml)
##            Category         Item_Name Item_ID      Brand   Price Series_ID
## 1       Electronics        Smartphone     101  TechBrand  699.99     101-A
## 2       Electronics        Smartphone     101  TechBrand  699.99     101-B
## 3       Electronics            Laptop     102 CompuBrand 1099.99     102-A
## 4       Electronics            Laptop     102 CompuBrand 1099.99     102-B
## 5   Home Appliances      Refrigerator     201   HomeCool  899.99     201-A
## 6   Home Appliances      Refrigerator     201   HomeCool  899.99     201-B
## 7   Home Appliances   Washing Machine     202  CleanTech  499.99     202-A
## 8   Home Appliances   Washing Machine     202  CleanTech  499.99     202-B
## 9          Clothing           T-Shirt     301  FashionCo   19.99     301-A
## 10         Clothing           T-Shirt     301  FashionCo   19.99     301-B
## 11         Clothing           T-Shirt     301  FashionCo   19.99     301-C
## 12         Clothing             Jeans     302 DenimWorks   49.99     302-A
## 13         Clothing             Jeans     302 DenimWorks   49.99     302-B
## 14            Books     Fiction Novel     401          -   14.99     401-A
## 15            Books     Fiction Novel     401          -   14.99     401-B
## 16            Books Non-Fiction Guide     402          -   24.99     402-A
## 17            Books Non-Fiction Guide     402          -   24.99     402-B
## 18 Sports Equipment        Basketball     501 SportsGear   29.99     501-A
## 19 Sports Equipment     Tennis Racket     502  RacketPro   89.99     502-A
## 20 Sports Equipment     Tennis Racket     502  RacketPro   89.99     502-B
##                                Series_Details
## 1                 Color: Black, Storage: 64GB
## 2                Color: White, Storage: 128GB
## 3               Color: Silver, Storage: 256GB
## 4           Color: Space Gray, Storage: 512GB
## 5  Color: Stainless Steel, Capacity: 20 cu ft
## 6            Color: White, Capacity: 18 cu ft
## 7       Type: Front Load, Capacity: 4.5 cu ft
## 8         Type: Top Load, Capacity: 5.0 cu ft
## 9                        Color: Blue, Size: S
## 10                        Color: Red, Size: M
## 11                      Color: Green, Size: L
## 12                 Color: Dark Blue, Size: 32
## 13                Color: Light Blue, Size: 34
## 14       Format: Hardcover, Language: English
## 15       Format: Paperback, Language: Spanish
## 16           Format: eBook, Language: English
## 17        Format: Paperback, Language: French
## 18                Size: Size 7, Color: Orange
## 19           Material: Graphite, Color: Black
## 20          Material: Aluminum, Color: Silver

4 - Parquet

I will use a JSON file and a format converter to transform it into Parquet format. Let’s then read our generated Parquet file into a dataframe named electronics_parquet.

library(arrow)
## 
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
## 
##     timestamp
url <- "https://raw.githubusercontent.com/Yedzinovich/Data-607/refs/heads/main/electronics.parquet"

temp_file <- tempfile(fileext = ".parquet")
download.file(url, temp_file, mode = "wb")

electronics_parquet <- read_parquet(temp_file)
print(electronics_parquet)
## # A tibble: 20 × 7
##    Category         Item_Name      Item_ID Brand  Price Series_ID Series_Details
##    <chr>            <chr>            <int> <chr>  <dbl> <chr>     <chr>         
##  1 Electronics      Smartphone         101 Tech…  700.  101-A     Color: Black,…
##  2 Electronics      Smartphone         101 Tech…  700.  101-B     Color: White,…
##  3 Electronics      Laptop             102 Comp… 1100.  102-A     Color: Silver…
##  4 Electronics      Laptop             102 Comp… 1100.  102-B     Color: Space …
##  5 Home Appliances  Refrigerator       201 Home…  900.  201-A     Color: Stainl…
##  6 Home Appliances  Refrigerator       201 Home…  900.  201-B     Color: White,…
##  7 Home Appliances  Washing Machi…     202 Clea…  500.  202-A     Type: Front L…
##  8 Home Appliances  Washing Machi…     202 Clea…  500.  202-B     Type: Top Loa…
##  9 Clothing         T-Shirt            301 Fash…   20.0 301-A     Color: Blue, …
## 10 Clothing         T-Shirt            301 Fash…   20.0 301-B     Color: Red, S…
## 11 Clothing         T-Shirt            301 Fash…   20.0 301-C     Color: Green,…
## 12 Clothing         Jeans              302 Deni…   50.0 302-A     Color: Dark B…
## 13 Clothing         Jeans              302 Deni…   50.0 302-B     Color: Light …
## 14 Books            Fiction Novel      401 -       15.0 401-A     Format: Hardc…
## 15 Books            Fiction Novel      401 -       15.0 401-B     Format: Paper…
## 16 Books            Non-Fiction G…     402 -       25.0 402-A     Format: eBook…
## 17 Books            Non-Fiction G…     402 -       25.0 402-B     Format: Paper…
## 18 Sports Equipment Basketball         501 Spor…   30.0 501-A     Size: Size 7,…
## 19 Sports Equipment Tennis Racket      502 Rack…   90.0 502-A     Material: Gra…
## 20 Sports Equipment Tennis Racket      502 Rack…   90.0 502-B     Material: Alu…

Conclusion

JSON (JavaScript Object Notation)

Pros: Easy to read, lightweight, widely supported. Cons: Can be large, no schema, limited data types.

HTML (HyperText Markup Language)

Pros: Great for web pages, rich formatting. Cons: Not for data storage, bulky, limited structure.

XML (eXtensible Markup Language)

Pros: Self-descriptive, flexible, supports schema. Cons: Very large, complex to parse, slower.

Parquet

Pros: Efficient storage, good compression, fast for big data. Cons: Not human-readable, complex tools needed, less support.

In conclusion, choosing the right data format depends on our specific needs and context. JSON is ideal for lightweight data interchange due to its simplicity and readability. HTML excels in presenting data on web pages with rich formatting options. XML offers flexibility and self-descriptive tags, making it suitable for complex hierarchical data structures, though it can be verbose and slower to process. Parquet stands out for its efficiency in storing and processing large-scale data, especially in analytical scenarios, despite requiring more sophisticated tools and being less human-readable.

Each format has its unique strengths and weaknesses, making them suitable for different applications.