Introduction

The assignment for this week involves data given from CUNYMart, acting as a header for a store that sells various items such as electronics, clothing, home appliances and more. I’ll be first showcasing the Data Preparation for the order to be the code to align correctly.

Data Preparation

raw_data <- data.frame(
  Category = c("Electronics", "Electronics", "Electronics", "Electronics",
               "Home Appliances", "Home Appliances", "Home Appliances", "Home Appliances",
               "Clothing", "Clothing", "Clothing", "Clothing", "Clothing",
               "Books", "Books", "Books", "Books",
               "Sports Equipment", "Sports Equipment", "Sports Equipment"),
  Item_Name = c("Smartphone", "Smartphone", "Laptop", "Laptop",
                "Refrigerator", "Refrigerator", "Washing Machine", "Washing Machine",
                "T-Shirt", "T-Shirt", "T-Shirt", "Jeans", "Jeans",
                "Fiction Novel", "Fiction Novel", "Non-Fiction Guide", "Non-Fiction Guide",
                "Basketball", "Tennis Racket", "Tennis Racket"),
  Item_ID = c(101, 101, 102, 102, 201, 201, 202, 202, 301, 301, 301, 302, 302,
              401, 401, 402, 402, 501, 502, 502),
  Brand = c("TechBrand", "TechBrand", "CompuBrand", "CompuBrand",
            "HomeCool", "HomeCool", "CleanTech", "CleanTech",
            "FashionCo", "FashionCo", "FashionCo", "DenimWorks", "DenimWorks",
            NA, NA, NA, NA,
            "SportsGear", "RacketPro", "RacketPro"),
  Price = c(699.99, 699.99, 1099.99, 1099.99,
            899.99, 899.99, 499.99, 499.99,
            19.99, 19.99, 19.99, 49.99, 49.99,
            14.99, 14.99, 24.99, 24.99,
            29.99, 89.99, 89.99),
  Variation_ID = c("101-A", "101-B", "102-A", "102-B",
                   "201-A", "201-B", "202-A", "202-B",
                   "301-A", "301-B", "301-C", "302-A", "302-B",
                   "401-A", "401-B", "402-A", "402-B",
                   "501-A", "502-A", "502-B"),
  Variation_Details = c("Color: Black, Storage: 64GB", "Color: White, Storage: 128GB",
                        "Color: Silver, Storage: 256GB", "Color: Space Gray, Storage: 512GB",
                        "Color: Stainless Steel, Capacity: 20 cu ft", "Color: White, Capacity: 18 cu ft",
                        "Type: Front Load, Capacity: 4.5 cu ft", "Type: Top Load, Capacity: 5.0 cu ft",
                        "Color: Blue, Size: S", "Color: Red, Size: M", "Color: Green, Size: L",
                        "Color: Dark Blue, Size: 32", "Color: Light Blue, Size: 34",
                        "Format: Hardcover, Language: English", "Format: Paperback, Language: Spanish",
                        "Format: eBook, Language: English", "Format: Paperback, Language: French",
                        "Size: Size 7, Color: Orange", "Material: Graphite, Color: Black",
                        "Material: Aluminum, Color: Silver"),
  stringsAsFactors = FALSE
)

JSON

According to chapter 23, JSON is the most common data format returned by web APIS, and is one of the 2 data formats Within web APIs that tend to be more structured and are safer and easier to work with. ## Pros - Safer to work with, because the API regulates what is being sent through between providers and consumers, as well as backups, and security access to the database. - Easier to code with because the data has already been scrubbed. ## Cons - Data must be managed within code only. - Hard to analyze for larger datasets due to the redundancy in the format’s display. - Highly repetitive, especially for larger datasets. The following is what JSON code would typically look like for the CUNYMart dataset.

[
  {
    "Category": "Electronics",
    "Item Name": "Smartphone",
    "Item ID": 101,
    "Brand": "TechBrand",
    "Price": 699.99,
    "Variation ID": "101-A",
    "Variation Details": {
      "Color": "Black",
      "Storage": "64GB"
    }
  },
  {
    "Category": "Electronics",
    "Item Name": "Smartphone",
    "Item ID": 101,
    "Brand": "TechBrand",
    "Price": 699.99,
    "Variation ID": "101-B",
    "Variation Details": {
      "Color": "White",
      "Storage": "128GB"
    }
  },
  {
    "Category": "Electronics",
    "Item Name": "Laptop",
    "Item ID": 102,
    "Brand": "CompuBrand",
    "Price": 1099.99,
    "Variation ID": "102-A",
    "Variation Details": {
      "Color": "Silver",
      "Storage": "256GB"
    }
  },
  {
    "Category": "Electronics",
    "Item Name": "Laptop",
    "Item ID": 102,
    "Brand": "CompuBrand",
    "Price": 1099.99,
    "Variation ID": "102-B",
    "Variation Details": {
      "Color": "Space Gray",
      "Storage": "512GB"
    }
  },
  {
    "Category": "Home Appliances",
    "Item Name": "Refrigerator",
    "Item ID": 201,
    "Brand": "HomeCool",
    "Price": 899.99,
    "Variation ID": "201-A",
    "Variation Details": {
      "Color": "Stainless Steel",
      "Capacity": "20 cu ft"
    }
  },
  {
    "Category": "Home Appliances",
    "Item Name": "Refrigerator",
    "Item ID": 201,
    "Brand": "HomeCool",
    "Price": 899.99,
    "Variation ID": "201-B",
    "Variation Details": {
      "Color": "White",
      "Capacity": "18 cu ft"
    }
  },
  {
    "Category": "Home Appliances",
    "Item Name": "Washing Machine",
    "Item ID": 202,
    "Brand": "CleanTech",
    "Price": 499.99,
    "Variation ID": "202-A",
    "Variation Details": {
      "Type": "Front Load",
      "Capacity": "4.5 cu ft"
    }
  },
  {
    "Category": "Home Appliances",
    "Item Name": "Washing Machine",
    "Item ID": 202,
    "Brand": "CleanTech",
    "Price": 499.99,
    "Variation ID": "202-B",
    "Variation Details": {
      "Type": "Top Load",
      "Capacity": "5.0 cu ft"
    }
  },
  {
    "Category": "Clothing",
    "Item Name": "T-Shirt",
    "Item ID": 301,
    "Brand": "FashionCo",
    "Price": 19.99,
    "Variation ID": "301-A",
    "Variation Details": {
      "Color": "Blue",
      "Size": "S"
    }
  },
  {
    "Category": "Clothing",
    "Item Name": "T-Shirt",
    "Item ID": 301,
    "Brand": "FashionCo",
    "Price": 19.99,
    "Variation ID": "301-B",
    "Variation Details": {
      "Color": "Red",
      "Size": "M"
    }
  },
  #etc. for the corresponding categories
]

R Code for JSON

json_data <- raw_data %>% 
  group_by(Category) %>% 
  summarise(items = list(cur_data()))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `items = list(cur_data())`.
## ℹ In group 1: `Category = "Books"`.
## Caused by warning:
## ! `cur_data()` was deprecated in dplyr 1.1.0.
## ℹ Please use `pick()` instead.

write_json(json_data, "inventory.json")
r_json <- read_json("inventory.json")

HTML

Chapter 24 dubs HTML as a form of web scraping, this is the less structured data format aimed for the less structured datasets. ## Pros - Works flexibly whether the site has API or not. - Simplistic view/easy readability. - Its complicity allows for further styling and interactivity (depending on programming language) ## Cons - Risky working with because any changes to structure of web page may break accessing routine. - Harder to code for due to its complexity threshold which at some point, would requirement of extensive knowledge and various different tools when compared to the Web APIs. - Likely that the data hasn’t been scrubbed, meaning the data hasn’t been properly structured at all beforehand. - Rough data interchangeability between providers and consumers.

<table>
  <tr>
    <th>Category</th>
    <th>Item Name</th>
    <th>Item ID</th>
    <th>Price</th>
  </tr>
  <tr>
    <td>Electronics</td>
    <td>Smartphone</td>
    <td>101</td>
    <td>699.99</td>
  </tr>
</table>
  #etc. for the corresponding categories

R Code for HTML

html_data <- raw_data %>%
  mutate(
    Variation_Details = strsplit(Variation_Details, ", "),
    details = lapply(Variation_Details, function(x) {
      keys <- trimws(gsub(":.*", "", x))
      values <- trimws(gsub(".*: ", "", x))
      setNames(as.list(values), keys)
    })
  )

html_content <- html_data %>%
  mutate(
    details_str = sapply(details, function(x) paste(names(x), x, sep = ": ", collapse = "; "))
  ) %>%
  select(Category, Item_Name, Item_ID, Price, details_str) %>%
  {paste0(
    "<table>\n<tr>",
    paste0("<th>", names(.), "</th>", collapse = ""),
    "</tr>\n",
    paste0(
      "<tr>",
      apply(., 1, function(row) paste0("<td>", row, "</td>", collapse = "")),
      "</tr>\n",
      collapse = ""
    ),
    "</table>"
  )}

writeLines(html_content, "inventory.html")
r_html <- read_html("inventory.html") %>%
  html_table() %>%
  .[[1]]

XML

One of the 2 data formats Within web APIs that tend to be more structured and are safer and easier to work with. ## Pros - Safer to work with, because the API regulates its structure. - Supports validation via schemas (XSD), which helps maintain data integrity. - Easier to code with because the data has already been scrubbed. ## Cons - More complex data transformation requirements compared to the JSON - Not as effective for modern web APIs because of how JSON is closely integrated with Javascript.

<Inventory>
  <Item>
    <Category>Electronics</Category>
    <ItemName>Smartphone</ItemName>
    <ItemID>101</ItemID>
    <Brand>TechBrand</Brand>
    <Price>699.99</Price>
    <Variation>
      <VariationID>101-A</VariationID>
      <Details>
        <Color>Black</Color>
        <Storage>64GB</Storage>
      </Details>
    </Variation>
  </Item>
  <Item>
    <Category>Electronics</Category>
    <ItemName>Smartphone</ItemName>
    <ItemID>101</ItemID>
    <Brand>TechBrand</Brand>
    <Price>699.99</Price>
    <Variation>
      <VariationID>101-B</VariationID>
      <Details>
        <Color>White</Color>
        <Storage>128GB</Storage>
      </Details>
    </Variation>
  </Item>
  #etc. for the corresponding categories

R code for XML

xml <- xml_new_root("inventory")
for (cat in unique(raw_data$Category)) {
  category_node <- xml_add_child(xml, "category", name = cat)
  items <- raw_data %>% filter(Category == cat)
  for (i in seq_len(nrow(items))) {
    item_node <- xml_add_child(category_node, "item", id = items$Item_ID[i])
    xml_add_child(item_node, "name", items$Item_Name[i])
    xml_add_child(item_node, "price", items$Price[i])
  }
}
write_xml(xml, "inventory.xml")
r_xml <- read_xml("inventory.xml") %>% 
  xml_find_all("//item") %>% 
  purrr::map_df(~list(
    category = xml_find_first(.x, "../@name") %>% xml_text(),
    id = xml_attr(.x, "id"),
    name = xml_text(xml_find_first(.x, "./name")),
    price = xml_text(xml_find_first(.x, "./price")) %>% as.numeric()
  ))

Parquet

The most unique data format of the 4 and a powerful alternative: the parquet format, an open standards-based format widely used by big data systems. ## Pros - Great for big datasets for conciseness - Compression for read-heavy analytical workloads. - Column storage allows for better funneling for any subsets. ## Cons - Initial binary format won’t allow visible reading externally. - Cannot be easily changed based on visibility of the data and the requirement of specialized tools such as the Arrow package.

R code for Parquet

For the sake of readability, I am using a temporary file for this example of R code for Parquet due to my error “([Windows error 1224] The requested operation cannot be performed on a file with a user-mapped section open)”, preventing the reading of a fixed path for Parquet:

library(arrow) #uses specialized arrow package
tmp_file <- tempfile(fileext = ".parquet")
write_parquet(raw_data, tmp_file)
r_parquet <- read_parquet(tmp_file)
print(r_parquet)

## # A tibble: 20 × 7
##    Category        Item_Name Item_ID Brand  Price Variation_ID Variation_Details
##    <chr>           <chr>       <dbl> <chr>  <dbl> <chr>        <chr>            
##  1 Electronics     Smartpho…     101 Tech…  700.  101-A        Color: Black, St…
##  2 Electronics     Smartpho…     101 Tech…  700.  101-B        Color: White, St…
##  3 Electronics     Laptop        102 Comp… 1100.  102-A        Color: Silver, S…
##  4 Electronics     Laptop        102 Comp… 1100.  102-B        Color: Space Gra…
##  5 Home Appliances Refriger…     201 Home…  900.  201-A        Color: Stainless…
##  6 Home Appliances Refriger…     201 Home…  900.  201-B        Color: White, Ca…
##  7 Home Appliances Washing …     202 Clea…  500.  202-A        Type: Front Load…
##  8 Home Appliances Washing …     202 Clea…  500.  202-B        Type: Top Load, …
##  9 Clothing        T-Shirt       301 Fash…   20.0 301-A        Color: Blue, Siz…
## 10 Clothing        T-Shirt       301 Fash…   20.0 301-B        Color: Red, Size…
## 11 Clothing        T-Shirt       301 Fash…   20.0 301-C        Color: Green, Si…
## 12 Clothing        Jeans         302 Deni…   50.0 302-A        Color: Dark Blue…
## 13 Clothing        Jeans         302 Deni…   50.0 302-B        Color: Light Blu…
## 14 Books           Fiction …     401 <NA>    15.0 401-A        Format: Hardcove…
## 15 Books           Fiction …     401 <NA>    15.0 401-B        Format: Paperbac…
## 16 Books           Non-Fict…     402 <NA>    25.0 402-A        Format: eBook, L…
## 17 Books           Non-Fict…     402 <NA>    25.0 402-B        Format: Paperbac…
## 18 Sports Equipme… Basketba…     501 Spor…   30.0 501-A        Size: Size 7, Co…
## 19 Sports Equipme… Tennis R…     502 Rack…   90.0 502-A        Material: Graphi…
## 20 Sports Equipme… Tennis R…     502 Rack…   90.0 502-B        Material: Alumin…

Conclusion

The following data formats—JSON, HTML, XML, and Parquet—to are used prepare and analyze the sample inventory data from CUNYMart, located at 123 Example Street, Anytown, USA. Each data format had their own pros and cons: JSON and XML acted as web APIs tools and are simplistic for data interchange and validation due to their already structured and scrubbed data. HTML acted as a rough way to explore data as a web scraping method, and is ideal for a last-resort for data exploring data inaccessible for APIs or for data needed for visualization, and Parquet offers a binary method for analytical workloads. Additionally, we discussed the importance of web scraping and web APIs as their stages in data exploring and how they translate in their data formats in terms of these pros and cons. Using this assignment as a self-described “warm-up” will allow me and other students to reliably focus past these surface level information in the coming 2 weeks.

DATA607 Week 7

Gullit Navarrete

2025-03-16