Loading libraries

library(tidyverse)
library(pdftools)
#library(tabulizer)
library(devtools)
library(dplyr)
library(knitr)
library(htmltools)
library(kableExtra)
library(arrow)
library(rvest)

Converting PDF to dataframe

This code extracts text from PDF file. Then it removes unnecessary lines to keep data only. Finally. the code creates lab7_inventory_datadataframe.

lab7_pdf <- pdf_text("C:/CUNY_MSDS/DATA607/File_Formats_Assignments.pdf")
cat(lab7_pdf)
## Data 607
## Assignment: working with JSON, HTML, XML, and Parquet in R
## You have received the following data from CUNYMart, located at 123 Example Street,
## Anytown, USA.
## 
## Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details
## Electronics,Smartphone,101,TechBrand,699.99,101-A,Color: Black, Storage: 64GB
## Electronics,Smartphone,101,TechBrand,699.99,101-B,Color: White, Storage: 128GB
## Electronics,Laptop,102,CompuBrand,1099.99,102-A,Color: Silver, Storage: 256GB
## Electronics,Laptop,102,CompuBrand,1099.99,102-B,Color: Space Gray, Storage: 512GB
## Home Appliances,Refrigerator,201,HomeCool,899.99,201-A,Color: Stainless Steel, Capacity:
## 20 cu ft
## Home Appliances,Refrigerator,201,HomeCool,899.99,201-B,Color: White, Capacity: 18 cu ft
## Home Appliances,Washing Machine,202,CleanTech,499.99,202-A,Type: Front Load, Capacity:
## 4.5 cu ft
## Home Appliances,Washing Machine,202,CleanTech,499.99,202-B,Type: Top Load, Capacity:
## 5.0 cu ft
## Clothing,T-Shirt,301,FashionCo,19.99,301-A,Color: Blue, Size: S
## Clothing,T-Shirt,301,FashionCo,19.99,301-B,Color: Red, Size: M
## Clothing,T-Shirt,301,FashionCo,19.99,301-C,Color: Green, Size: L
## Clothing,Jeans,302,DenimWorks,49.99,302-A,Color: Dark Blue, Size: 32
## Clothing,Jeans,302,DenimWorks,49.99,302-B,Color: Light Blue, Size: 34
## Books,Fiction Novel,401,-,14.99,401-A,Format: Hardcover, Language: English
## Books,Fiction Novel,401,-,14.99,401-B,Format: Paperback, Language: Spanish
## Books,Non-Fiction Guide,402,-,24.99,402-A,Format: eBook, Language: English
## Books,Non-Fiction Guide,402,-,24.99,402-B,Format: Paperback, Language: French
## Sports Equipment,Basketball,501,SportsGear,29.99,501-A,Size: Size 7, Color: Orange
## Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-A,Material: Graphite, Color: Black
## Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-B,Material: Aluminum, Color: Silver
## 
## 
## This data will be used for inventory analysis at the retailer. You are required to prepare the data
## for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, provide the pros
## and cons of each format.
## 
## Your must include R code for generating and importing the data into R.
txt_lines <- unlist(strsplit(lab7_pdf, "\n"))

head(txt_lines)
## [1] "Data 607"                                                                          
## [2] "Assignment: working with JSON, HTML, XML, and Parquet in R"                        
## [3] "You have received the following data from CUNYMart, located at 123 Example Street,"
## [4] "Anytown, USA."                                                                     
## [5] ""                                                                                  
## [6] "Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details"
#tail(txt_lines)

txt_lines_clean <- txt_lines[-(1:5)]
txt_lines_clean <- txt_lines_clean[-(25:31)]


#print(txt_lines_clean)

lab7_inventory_data <- do.call(rbind, strsplit(txt_lines_clean, ","))
## Warning in (function (..., deparse.level = 1) : number of columns of result is
## not a multiple of vector length (arg 1)
lab7_inventory_data <- as.data.frame(lab7_inventory_data)
colnames(lab7_inventory_data) <- as.character(lab7_inventory_data[1, ])
lab7_inventory_data <- lab7_inventory_data[-1, ]
colnames(lab7_inventory_data)[ncol(lab7_inventory_data)] <- "Description"
head(lab7_inventory_data)
##          Category    Item Name  Item ID      Brand    Price Variation ID
## 2     Electronics   Smartphone      101  TechBrand   699.99        101-A
## 3     Electronics   Smartphone      101  TechBrand   699.99        101-B
## 4     Electronics       Laptop      102 CompuBrand  1099.99        102-A
## 5     Electronics       Laptop      102 CompuBrand  1099.99        102-B
## 6 Home Appliances Refrigerator      201   HomeCool   899.99        201-A
## 7        20 cu ft     20 cu ft 20 cu ft   20 cu ft 20 cu ft     20 cu ft
##        Variation Details     Description
## 2           Color: Black   Storage: 64GB
## 3           Color: White  Storage: 128GB
## 4          Color: Silver  Storage: 256GB
## 5      Color: Space Gray  Storage: 512GB
## 6 Color: Stainless Steel       Capacity:
## 7               20 cu ft        20 cu ft

Here I do some data cleaning and formatting.

lab7_inventory_data$Description[5] <- "Capacity: 20 cu ft"
lab7_inventory_data$Description[8] <- "Capacity: 4.5 cu ft"
lab7_inventory_data$Description[10] <- "Capacity: 5.0 cu ft"
lab7_inventory_data <- lab7_inventory_data[-c(6, 9, 11), ]
head(lab7_inventory_data)
##          Category    Item Name Item ID      Brand   Price Variation ID
## 2     Electronics   Smartphone     101  TechBrand  699.99        101-A
## 3     Electronics   Smartphone     101  TechBrand  699.99        101-B
## 4     Electronics       Laptop     102 CompuBrand 1099.99        102-A
## 5     Electronics       Laptop     102 CompuBrand 1099.99        102-B
## 6 Home Appliances Refrigerator     201   HomeCool  899.99        201-A
## 8 Home Appliances Refrigerator     201   HomeCool  899.99        201-B
##        Variation Details         Description
## 2           Color: Black       Storage: 64GB
## 3           Color: White      Storage: 128GB
## 4          Color: Silver      Storage: 256GB
## 5      Color: Space Gray      Storage: 512GB
## 6 Color: Stainless Steel  Capacity: 20 cu ft
## 8           Color: White  Capacity: 18 cu ft

Creating XML data file

This code creates lab7_inventory.xmlfrom lab7_inventory_data. After that it converts created XML file back into dataframe.

library(XML)
lab7_to_xml <- function(df, root_name = "Inventory", item_name = "Item") {
  if (!is.data.frame(df)) stop("Error: Input data must be a dataframe")

  doc <- newXMLDoc()
  root <- newXMLNode(root_name, doc = doc)

  apply(df, 1, function(row) {
    row <- as.list(row)
    row <- lapply(row, as.character)  # Convert all to character

    # Create Item node with attributes for Item Name and ID
    item <- newXMLNode(item_name, attrs = c(Name = row[["Item Name"]], ID = row[["Item ID"]]), parent = root)

    # Add main fields
    newXMLNode("Category", row[["Category"]], parent = item)
    newXMLNode("Brand", row[["Brand"]], parent = item)
    newXMLNode("Price", row[["Price"]], parent = item)

    # Store Variation ID as an attribute
    if (!is.null(row[["Variation ID"]]) && nzchar(row[["Variation ID"]])) {
      variation <- newXMLNode("Variation", attrs = c(ID = row[["Variation ID"]]), parent = item)
      newXMLNode("Details", row[["Variation Details"]], parent = variation)
    } else {
      # Even if Variation ID is missing, create a blank field
      newXMLNode("Variation", attrs = c(ID = ""), parent = item)
    }

    # Add Description
    newXMLNode("Description", row[["Description"]], parent = item)
  })

  return(doc)
}

xml_doc <- lab7_to_xml(lab7_inventory_data)

saveXML(xml_doc, file = "lab7_inventory.xml")
## [1] "lab7_inventory.xml"
cat(readLines("lab7_inventory.xml"), sep = "\n")
## <?xml version="1.0"?>
## <Inventory>
##   <Item Name="Smartphone" ID="101">
##     <Category>Electronics</Category>
##     <Brand>TechBrand</Brand>
##     <Price>699.99</Price>
##     <Variation ID="101-A">
##       <Details>Color: Black</Details>
##     </Variation>
##     <Description> Storage: 64GB</Description>
##   </Item>
##   <Item Name="Smartphone" ID="101">
##     <Category>Electronics</Category>
##     <Brand>TechBrand</Brand>
##     <Price>699.99</Price>
##     <Variation ID="101-B">
##       <Details>Color: White</Details>
##     </Variation>
##     <Description> Storage: 128GB</Description>
##   </Item>
##   <Item Name="Laptop" ID="102">
##     <Category>Electronics</Category>
##     <Brand>CompuBrand</Brand>
##     <Price>1099.99</Price>
##     <Variation ID="102-A">
##       <Details>Color: Silver</Details>
##     </Variation>
##     <Description> Storage: 256GB</Description>
##   </Item>
##   <Item Name="Laptop" ID="102">
##     <Category>Electronics</Category>
##     <Brand>CompuBrand</Brand>
##     <Price>1099.99</Price>
##     <Variation ID="102-B">
##       <Details>Color: Space Gray</Details>
##     </Variation>
##     <Description> Storage: 512GB</Description>
##   </Item>
##   <Item Name="Refrigerator" ID="201">
##     <Category>Home Appliances</Category>
##     <Brand>HomeCool</Brand>
##     <Price>899.99</Price>
##     <Variation ID="201-A">
##       <Details>Color: Stainless Steel</Details>
##     </Variation>
##     <Description>Capacity: 20 cu ft</Description>
##   </Item>
##   <Item Name="Refrigerator" ID="201">
##     <Category>Home Appliances</Category>
##     <Brand>HomeCool</Brand>
##     <Price>899.99</Price>
##     <Variation ID="201-B">
##       <Details>Color: White</Details>
##     </Variation>
##     <Description> Capacity: 18 cu ft</Description>
##   </Item>
##   <Item Name="Washing Machine" ID="202">
##     <Category>Home Appliances</Category>
##     <Brand>CleanTech</Brand>
##     <Price>499.99</Price>
##     <Variation ID="202-A">
##       <Details>Type: Front Load</Details>
##     </Variation>
##     <Description>Capacity: 4.5 cu ft</Description>
##   </Item>
##   <Item Name="Washing Machine" ID="202">
##     <Category>Home Appliances</Category>
##     <Brand>CleanTech</Brand>
##     <Price>499.99</Price>
##     <Variation ID="202-B">
##       <Details>Type: Top Load</Details>
##     </Variation>
##     <Description>Capacity: 5.0 cu ft</Description>
##   </Item>
##   <Item Name="T-Shirt" ID="301">
##     <Category>Clothing</Category>
##     <Brand>FashionCo</Brand>
##     <Price>19.99</Price>
##     <Variation ID="301-A">
##       <Details>Color: Blue</Details>
##     </Variation>
##     <Description> Size: S</Description>
##   </Item>
##   <Item Name="T-Shirt" ID="301">
##     <Category>Clothing</Category>
##     <Brand>FashionCo</Brand>
##     <Price>19.99</Price>
##     <Variation ID="301-B">
##       <Details>Color: Red</Details>
##     </Variation>
##     <Description> Size: M</Description>
##   </Item>
##   <Item Name="T-Shirt" ID="301">
##     <Category>Clothing</Category>
##     <Brand>FashionCo</Brand>
##     <Price>19.99</Price>
##     <Variation ID="301-C">
##       <Details>Color: Green</Details>
##     </Variation>
##     <Description> Size: L</Description>
##   </Item>
##   <Item Name="Jeans" ID="302">
##     <Category>Clothing</Category>
##     <Brand>DenimWorks</Brand>
##     <Price>49.99</Price>
##     <Variation ID="302-A">
##       <Details>Color: Dark Blue</Details>
##     </Variation>
##     <Description> Size: 32</Description>
##   </Item>
##   <Item Name="Jeans" ID="302">
##     <Category>Clothing</Category>
##     <Brand>DenimWorks</Brand>
##     <Price>49.99</Price>
##     <Variation ID="302-B">
##       <Details>Color: Light Blue</Details>
##     </Variation>
##     <Description> Size: 34</Description>
##   </Item>
##   <Item Name="Fiction Novel" ID="401">
##     <Category>Books</Category>
##     <Brand>-</Brand>
##     <Price>14.99</Price>
##     <Variation ID="401-A">
##       <Details>Format: Hardcover</Details>
##     </Variation>
##     <Description> Language: English</Description>
##   </Item>
##   <Item Name="Fiction Novel" ID="401">
##     <Category>Books</Category>
##     <Brand>-</Brand>
##     <Price>14.99</Price>
##     <Variation ID="401-B">
##       <Details>Format: Paperback</Details>
##     </Variation>
##     <Description> Language: Spanish</Description>
##   </Item>
##   <Item Name="Non-Fiction Guide" ID="402">
##     <Category>Books</Category>
##     <Brand>-</Brand>
##     <Price>24.99</Price>
##     <Variation ID="402-A">
##       <Details>Format: eBook</Details>
##     </Variation>
##     <Description> Language: English</Description>
##   </Item>
##   <Item Name="Non-Fiction Guide" ID="402">
##     <Category>Books</Category>
##     <Brand>-</Brand>
##     <Price>24.99</Price>
##     <Variation ID="402-B">
##       <Details>Format: Paperback</Details>
##     </Variation>
##     <Description> Language: French</Description>
##   </Item>
##   <Item Name="Basketball" ID="501">
##     <Category>Sports Equipment</Category>
##     <Brand>SportsGear</Brand>
##     <Price>29.99</Price>
##     <Variation ID="501-A">
##       <Details>Size: Size 7</Details>
##     </Variation>
##     <Description> Color: Orange</Description>
##   </Item>
##   <Item Name="Tennis Racket" ID="502">
##     <Category>Sports Equipment</Category>
##     <Brand>RacketPro</Brand>
##     <Price>89.99</Price>
##     <Variation ID="502-A">
##       <Details>Material: Graphite</Details>
##     </Variation>
##     <Description> Color: Black</Description>
##   </Item>
##   <Item Name="Tennis Racket" ID="502">
##     <Category>Sports Equipment</Category>
##     <Brand>RacketPro</Brand>
##     <Price>89.99</Price>
##     <Variation ID="502-B">
##       <Details>Material: Aluminum</Details>
##     </Variation>
##     <Description> Color: Silver</Description>
##   </Item>
## </Inventory>
# reading XML
xml_doc <- xmlParse("lab7_inventory.xml")

# Extract all Item nodes
items <- getNodeSet(xml_doc, "//Item")

# Convert XML to DataFrame (excluding attributes)
lab7_xml_copy <- xmlToDataFrame(nodes = items)

# Extract Variation IDs separately by parsing XML attributes
variation_ids <- sapply(items, function(x) {
  variation_node <- getNodeSet(x, "./Variation")  # Locate Variation node within each Item
  if (length(variation_node) > 0) {
    xmlGetAttr(variation_node[[1]], "ID", default = "")  # Extract "ID" attribute
  } else {
    ""
  }
})

# Add Variation ID column back to DataFrame
lab7_xml_copy$`Variation ID` <- variation_ids

# Print final DataFrame
head(lab7_xml_copy)
##          Category      Brand   Price              Variation         Description
## 1     Electronics  TechBrand  699.99           Color: Black       Storage: 64GB
## 2     Electronics  TechBrand  699.99           Color: White      Storage: 128GB
## 3     Electronics CompuBrand 1099.99          Color: Silver      Storage: 256GB
## 4     Electronics CompuBrand 1099.99      Color: Space Gray      Storage: 512GB
## 5 Home Appliances   HomeCool  899.99 Color: Stainless Steel  Capacity: 20 cu ft
## 6 Home Appliances   HomeCool  899.99           Color: White  Capacity: 18 cu ft
##   Variation ID
## 1        101-A
## 2        101-B
## 3        102-A
## 4        102-B
## 5        201-A
## 6        201-B

Pros and cons of XML format:

Creating JSON file format

This code creates lab7_inventory.jsonfrom dataframe. After that, code converts it back to dataframe.

library(jsonlite)
lab7_inventory_json <- toJSON(lab7_inventory_data)
write(lab7_inventory_json, "lab7_inventory.json")
cat(readLines("lab7_inventory.json"), sep = "\n")
## [{"Category":"Electronics","Item Name":"Smartphone","Item ID":"101","Brand":"TechBrand","Price":"699.99","Variation ID":"101-A","Variation Details":"Color: Black","Description":" Storage: 64GB"},{"Category":"Electronics","Item Name":"Smartphone","Item ID":"101","Brand":"TechBrand","Price":"699.99","Variation ID":"101-B","Variation Details":"Color: White","Description":" Storage: 128GB"},{"Category":"Electronics","Item Name":"Laptop","Item ID":"102","Brand":"CompuBrand","Price":"1099.99","Variation ID":"102-A","Variation Details":"Color: Silver","Description":" Storage: 256GB"},{"Category":"Electronics","Item Name":"Laptop","Item ID":"102","Brand":"CompuBrand","Price":"1099.99","Variation ID":"102-B","Variation Details":"Color: Space Gray","Description":" Storage: 512GB"},{"Category":"Home Appliances","Item Name":"Refrigerator","Item ID":"201","Brand":"HomeCool","Price":"899.99","Variation ID":"201-A","Variation Details":"Color: Stainless Steel","Description":"Capacity: 20 cu ft"},{"Category":"Home Appliances","Item Name":"Refrigerator","Item ID":"201","Brand":"HomeCool","Price":"899.99","Variation ID":"201-B","Variation Details":"Color: White","Description":" Capacity: 18 cu ft"},{"Category":"Home Appliances","Item Name":"Washing Machine","Item ID":"202","Brand":"CleanTech","Price":"499.99","Variation ID":"202-A","Variation Details":"Type: Front Load","Description":"Capacity: 4.5 cu ft"},{"Category":"Home Appliances","Item Name":"Washing Machine","Item ID":"202","Brand":"CleanTech","Price":"499.99","Variation ID":"202-B","Variation Details":"Type: Top Load","Description":"Capacity: 5.0 cu ft"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-A","Variation Details":"Color: Blue","Description":" Size: S"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-B","Variation Details":"Color: Red","Description":" Size: M"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-C","Variation Details":"Color: Green","Description":" Size: L"},{"Category":"Clothing","Item Name":"Jeans","Item ID":"302","Brand":"DenimWorks","Price":"49.99","Variation ID":"302-A","Variation Details":"Color: Dark Blue","Description":" Size: 32"},{"Category":"Clothing","Item Name":"Jeans","Item ID":"302","Brand":"DenimWorks","Price":"49.99","Variation ID":"302-B","Variation Details":"Color: Light Blue","Description":" Size: 34"},{"Category":"Books","Item Name":"Fiction Novel","Item ID":"401","Brand":"-","Price":"14.99","Variation ID":"401-A","Variation Details":"Format: Hardcover","Description":" Language: English"},{"Category":"Books","Item Name":"Fiction Novel","Item ID":"401","Brand":"-","Price":"14.99","Variation ID":"401-B","Variation Details":"Format: Paperback","Description":" Language: Spanish"},{"Category":"Books","Item Name":"Non-Fiction Guide","Item ID":"402","Brand":"-","Price":"24.99","Variation ID":"402-A","Variation Details":"Format: eBook","Description":" Language: English"},{"Category":"Books","Item Name":"Non-Fiction Guide","Item ID":"402","Brand":"-","Price":"24.99","Variation ID":"402-B","Variation Details":"Format: Paperback","Description":" Language: French"},{"Category":"Sports Equipment","Item Name":"Basketball","Item ID":"501","Brand":"SportsGear","Price":"29.99","Variation ID":"501-A","Variation Details":"Size: Size 7","Description":" Color: Orange"},{"Category":"Sports Equipment","Item Name":"Tennis Racket","Item ID":"502","Brand":"RacketPro","Price":"89.99","Variation ID":"502-A","Variation Details":"Material: Graphite","Description":" Color: Black"},{"Category":"Sports Equipment","Item Name":"Tennis Racket","Item ID":"502","Brand":"RacketPro","Price":"89.99","Variation ID":"502-B","Variation Details":"Material: Aluminum","Description":" Color: Silver"}]
json_data <- fromJSON("lab7_inventory.json")
lab7_JSON_copy <- as.data.frame(json_data)
head(lab7_JSON_copy)
##          Category    Item Name Item ID      Brand   Price Variation ID
## 1     Electronics   Smartphone     101  TechBrand  699.99        101-A
## 2     Electronics   Smartphone     101  TechBrand  699.99        101-B
## 3     Electronics       Laptop     102 CompuBrand 1099.99        102-A
## 4     Electronics       Laptop     102 CompuBrand 1099.99        102-B
## 5 Home Appliances Refrigerator     201   HomeCool  899.99        201-A
## 6 Home Appliances Refrigerator     201   HomeCool  899.99        201-B
##        Variation Details         Description
## 1           Color: Black       Storage: 64GB
## 2           Color: White      Storage: 128GB
## 3          Color: Silver      Storage: 256GB
## 4      Color: Space Gray      Storage: 512GB
## 5 Color: Stainless Steel  Capacity: 20 cu ft
## 6           Color: White  Capacity: 18 cu ft

Pros and cons of JSON format:

Creating HTML format file

Here dataframe is converted to lab7_inventory.html. After that, code converts HTML file back into dataframe.

html_table <- kable(lab7_inventory_data, format = "html", escape = FALSE)

html_output <- paste0("<html><head><title>Inventory</title></head><body>", 
                      html_table, 
                      "</body></html>")

writeLines(html_output, "lab7_inventory.html")
cat(readLines("lab7_inventory.html"), sep = "\n")
## <html><head><title>Inventory</title></head><body><table>
##  <thead>
##   <tr>
##    <th style="text-align:left;">   </th>
##    <th style="text-align:left;"> Category </th>
##    <th style="text-align:left;"> Item Name </th>
##    <th style="text-align:left;"> Item ID </th>
##    <th style="text-align:left;"> Brand </th>
##    <th style="text-align:left;"> Price </th>
##    <th style="text-align:left;"> Variation ID </th>
##    <th style="text-align:left;"> Variation Details </th>
##    <th style="text-align:left;"> Description </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> 2 </td>
##    <td style="text-align:left;"> Electronics </td>
##    <td style="text-align:left;"> Smartphone </td>
##    <td style="text-align:left;"> 101 </td>
##    <td style="text-align:left;"> TechBrand </td>
##    <td style="text-align:left;"> 699.99 </td>
##    <td style="text-align:left;"> 101-A </td>
##    <td style="text-align:left;"> Color: Black </td>
##    <td style="text-align:left;"> Storage: 64GB </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 3 </td>
##    <td style="text-align:left;"> Electronics </td>
##    <td style="text-align:left;"> Smartphone </td>
##    <td style="text-align:left;"> 101 </td>
##    <td style="text-align:left;"> TechBrand </td>
##    <td style="text-align:left;"> 699.99 </td>
##    <td style="text-align:left;"> 101-B </td>
##    <td style="text-align:left;"> Color: White </td>
##    <td style="text-align:left;"> Storage: 128GB </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 4 </td>
##    <td style="text-align:left;"> Electronics </td>
##    <td style="text-align:left;"> Laptop </td>
##    <td style="text-align:left;"> 102 </td>
##    <td style="text-align:left;"> CompuBrand </td>
##    <td style="text-align:left;"> 1099.99 </td>
##    <td style="text-align:left;"> 102-A </td>
##    <td style="text-align:left;"> Color: Silver </td>
##    <td style="text-align:left;"> Storage: 256GB </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 5 </td>
##    <td style="text-align:left;"> Electronics </td>
##    <td style="text-align:left;"> Laptop </td>
##    <td style="text-align:left;"> 102 </td>
##    <td style="text-align:left;"> CompuBrand </td>
##    <td style="text-align:left;"> 1099.99 </td>
##    <td style="text-align:left;"> 102-B </td>
##    <td style="text-align:left;"> Color: Space Gray </td>
##    <td style="text-align:left;"> Storage: 512GB </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 6 </td>
##    <td style="text-align:left;"> Home Appliances </td>
##    <td style="text-align:left;"> Refrigerator </td>
##    <td style="text-align:left;"> 201 </td>
##    <td style="text-align:left;"> HomeCool </td>
##    <td style="text-align:left;"> 899.99 </td>
##    <td style="text-align:left;"> 201-A </td>
##    <td style="text-align:left;"> Color: Stainless Steel </td>
##    <td style="text-align:left;"> Capacity: 20 cu ft </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 8 </td>
##    <td style="text-align:left;"> Home Appliances </td>
##    <td style="text-align:left;"> Refrigerator </td>
##    <td style="text-align:left;"> 201 </td>
##    <td style="text-align:left;"> HomeCool </td>
##    <td style="text-align:left;"> 899.99 </td>
##    <td style="text-align:left;"> 201-B </td>
##    <td style="text-align:left;"> Color: White </td>
##    <td style="text-align:left;"> Capacity: 18 cu ft </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 9 </td>
##    <td style="text-align:left;"> Home Appliances </td>
##    <td style="text-align:left;"> Washing Machine </td>
##    <td style="text-align:left;"> 202 </td>
##    <td style="text-align:left;"> CleanTech </td>
##    <td style="text-align:left;"> 499.99 </td>
##    <td style="text-align:left;"> 202-A </td>
##    <td style="text-align:left;"> Type: Front Load </td>
##    <td style="text-align:left;"> Capacity: 4.5 cu ft </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 11 </td>
##    <td style="text-align:left;"> Home Appliances </td>
##    <td style="text-align:left;"> Washing Machine </td>
##    <td style="text-align:left;"> 202 </td>
##    <td style="text-align:left;"> CleanTech </td>
##    <td style="text-align:left;"> 499.99 </td>
##    <td style="text-align:left;"> 202-B </td>
##    <td style="text-align:left;"> Type: Top Load </td>
##    <td style="text-align:left;"> Capacity: 5.0 cu ft </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 13 </td>
##    <td style="text-align:left;"> Clothing </td>
##    <td style="text-align:left;"> T-Shirt </td>
##    <td style="text-align:left;"> 301 </td>
##    <td style="text-align:left;"> FashionCo </td>
##    <td style="text-align:left;"> 19.99 </td>
##    <td style="text-align:left;"> 301-A </td>
##    <td style="text-align:left;"> Color: Blue </td>
##    <td style="text-align:left;"> Size: S </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 14 </td>
##    <td style="text-align:left;"> Clothing </td>
##    <td style="text-align:left;"> T-Shirt </td>
##    <td style="text-align:left;"> 301 </td>
##    <td style="text-align:left;"> FashionCo </td>
##    <td style="text-align:left;"> 19.99 </td>
##    <td style="text-align:left;"> 301-B </td>
##    <td style="text-align:left;"> Color: Red </td>
##    <td style="text-align:left;"> Size: M </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 15 </td>
##    <td style="text-align:left;"> Clothing </td>
##    <td style="text-align:left;"> T-Shirt </td>
##    <td style="text-align:left;"> 301 </td>
##    <td style="text-align:left;"> FashionCo </td>
##    <td style="text-align:left;"> 19.99 </td>
##    <td style="text-align:left;"> 301-C </td>
##    <td style="text-align:left;"> Color: Green </td>
##    <td style="text-align:left;"> Size: L </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 16 </td>
##    <td style="text-align:left;"> Clothing </td>
##    <td style="text-align:left;"> Jeans </td>
##    <td style="text-align:left;"> 302 </td>
##    <td style="text-align:left;"> DenimWorks </td>
##    <td style="text-align:left;"> 49.99 </td>
##    <td style="text-align:left;"> 302-A </td>
##    <td style="text-align:left;"> Color: Dark Blue </td>
##    <td style="text-align:left;"> Size: 32 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 17 </td>
##    <td style="text-align:left;"> Clothing </td>
##    <td style="text-align:left;"> Jeans </td>
##    <td style="text-align:left;"> 302 </td>
##    <td style="text-align:left;"> DenimWorks </td>
##    <td style="text-align:left;"> 49.99 </td>
##    <td style="text-align:left;"> 302-B </td>
##    <td style="text-align:left;"> Color: Light Blue </td>
##    <td style="text-align:left;"> Size: 34 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 18 </td>
##    <td style="text-align:left;"> Books </td>
##    <td style="text-align:left;"> Fiction Novel </td>
##    <td style="text-align:left;"> 401 </td>
##    <td style="text-align:left;"> - </td>
##    <td style="text-align:left;"> 14.99 </td>
##    <td style="text-align:left;"> 401-A </td>
##    <td style="text-align:left;"> Format: Hardcover </td>
##    <td style="text-align:left;"> Language: English </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 19 </td>
##    <td style="text-align:left;"> Books </td>
##    <td style="text-align:left;"> Fiction Novel </td>
##    <td style="text-align:left;"> 401 </td>
##    <td style="text-align:left;"> - </td>
##    <td style="text-align:left;"> 14.99 </td>
##    <td style="text-align:left;"> 401-B </td>
##    <td style="text-align:left;"> Format: Paperback </td>
##    <td style="text-align:left;"> Language: Spanish </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 20 </td>
##    <td style="text-align:left;"> Books </td>
##    <td style="text-align:left;"> Non-Fiction Guide </td>
##    <td style="text-align:left;"> 402 </td>
##    <td style="text-align:left;"> - </td>
##    <td style="text-align:left;"> 24.99 </td>
##    <td style="text-align:left;"> 402-A </td>
##    <td style="text-align:left;"> Format: eBook </td>
##    <td style="text-align:left;"> Language: English </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 21 </td>
##    <td style="text-align:left;"> Books </td>
##    <td style="text-align:left;"> Non-Fiction Guide </td>
##    <td style="text-align:left;"> 402 </td>
##    <td style="text-align:left;"> - </td>
##    <td style="text-align:left;"> 24.99 </td>
##    <td style="text-align:left;"> 402-B </td>
##    <td style="text-align:left;"> Format: Paperback </td>
##    <td style="text-align:left;"> Language: French </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 22 </td>
##    <td style="text-align:left;"> Sports Equipment </td>
##    <td style="text-align:left;"> Basketball </td>
##    <td style="text-align:left;"> 501 </td>
##    <td style="text-align:left;"> SportsGear </td>
##    <td style="text-align:left;"> 29.99 </td>
##    <td style="text-align:left;"> 501-A </td>
##    <td style="text-align:left;"> Size: Size 7 </td>
##    <td style="text-align:left;"> Color: Orange </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 23 </td>
##    <td style="text-align:left;"> Sports Equipment </td>
##    <td style="text-align:left;"> Tennis Racket </td>
##    <td style="text-align:left;"> 502 </td>
##    <td style="text-align:left;"> RacketPro </td>
##    <td style="text-align:left;"> 89.99 </td>
##    <td style="text-align:left;"> 502-A </td>
##    <td style="text-align:left;"> Material: Graphite </td>
##    <td style="text-align:left;"> Color: Black </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> 24 </td>
##    <td style="text-align:left;"> Sports Equipment </td>
##    <td style="text-align:left;"> Tennis Racket </td>
##    <td style="text-align:left;"> 502 </td>
##    <td style="text-align:left;"> RacketPro </td>
##    <td style="text-align:left;"> 89.99 </td>
##    <td style="text-align:left;"> 502-B </td>
##    <td style="text-align:left;"> Material: Aluminum </td>
##    <td style="text-align:left;"> Color: Silver </td>
##   </tr>
## </tbody>
## </table></body></html>
html_file <- read_html("lab7_inventory.html")
tables <- html_file |>
  html_table(fill = TRUE)
lab7_HTML_copy <- tables[[1]]
head(lab7_HTML_copy)
## # A tibble: 6 × 9
##      `` Category        `Item Name`  `Item ID` Brand      Price `Variation ID`
##   <int> <chr>           <chr>            <int> <chr>      <dbl> <chr>         
## 1     2 Electronics     Smartphone         101 TechBrand   700. 101-A         
## 2     3 Electronics     Smartphone         101 TechBrand   700. 101-B         
## 3     4 Electronics     Laptop             102 CompuBrand 1100. 102-A         
## 4     5 Electronics     Laptop             102 CompuBrand 1100. 102-B         
## 5     6 Home Appliances Refrigerator       201 HomeCool    900. 201-A         
## 6     8 Home Appliances Refrigerator       201 HomeCool    900. 201-B         
## # ℹ 2 more variables: `Variation Details` <chr>, Description <chr>

Pros and cons of HTML format:

Creating PARQUET format file

This code creates lab_7inventory.parquet from dataframe. And after, code converts Parquet file back into dataframe

write_parquet(lab7_inventory_data, "lab_7inventory.parquet")
raw_data <- readBin("lab_7inventory.parquet", what = "raw", n = 100)

# Print first 100 bytes
print(raw_data)
##   [1] 50 41 52 31 15 04 15 96 01 15 9c 01 4c 15 0a 15 00 12 00 00 4b f0 4a 0b 00
##  [26] 00 00 45 6c 65 63 74 72 6f 6e 69 63 73 0f 00 00 00 48 6f 6d 65 20 41 70 70
##  [51] 6c 69 61 6e 63 65 73 08 00 00 00 43 6c 6f 74 68 69 6e 67 05 00 00 00 42 6f
##  [76] 6f 6b 73 10 00 00 00 53 70 6f 72 74 73 20 45 71 75 69 70 6d 65 6e 74 15 00
df_parquet <- read_parquet("lab_7inventory.parquet")

head(df_parquet)
## # A tibble: 6 × 8
##   Category  `Item Name` `Item ID` Brand Price `Variation ID` `Variation Details`
##   <chr>     <chr>       <chr>     <chr> <chr> <chr>          <chr>              
## 1 Electron… Smartphone  101       Tech… 699.… 101-A          Color: Black       
## 2 Electron… Smartphone  101       Tech… 699.… 101-B          Color: White       
## 3 Electron… Laptop      102       Comp… 1099… 102-A          Color: Silver      
## 4 Electron… Laptop      102       Comp… 1099… 102-B          Color: Space Gray  
## 5 Home App… Refrigerat… 201       Home… 899.… 201-A          Color: Stainless S…
## 6 Home App… Refrigerat… 201       Home… 899.… 201-B          Color: White       
## # ℹ 1 more variable: Description <chr>

Pros and cons of PARQUET format: