Preparing different file formats

In this assigname, we were asked to prepare a dataset by first formatting into four different file formats (JSON, HTML, XML, and Parquet) and then reading it into R. I began by creating an HTML table from the data that was provided, then imported the it using the readHTMLTable from the XML package. The package did a good job at converting the HTML table elements into a dataframe, using the values in row 1 as my column names. The package brings in all column data in a char format, however, which will need to be manually reassigned to the respective variable types using transform where [col_value] = as.var_type.

html_file <- "cuny-mart-data.html"

html_df <- readHTMLTable(html_file)
html_df <- list.clean(html_df, fun = is.null, recursive = FALSE)
names(html_df)
## [1] "products"
html_products <- html_df$products
glimpse(html_products)
## Rows: 20
## Columns: 8
## $ Category       <chr> "Electronics", "Electronics", "Electronics", "Electroni…
## $ `Item Name`    <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "Refrig…
## $ `Item ID`      <chr> "101", "101", "102", "102", "201", "201", "202", "202",…
## $ Brand          <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "…
## $ Price          <chr> "699.99", "699.99", "1099.99", "1099.99", "899.99", "89…
## $ `Variation ID` <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-B", "…
## $ Variation      <chr> "Color: Black", "Color: White", "Color: Silver", "Color…
## $ Details        <chr> "Storage: 64GB", "Storage: 128GB", "Storage: 256GB", "S…

Exporting as Different Formats

XML

Once we have imported our HTML set and have it in a dataframe, we can now export it to the other requested formats. First I exported the data in XML format using the XML package. I needed to create a parent node “products” then used apply to generate child nodes “product” and transform my dataframe’s rows into their own grandchildren nodes. Finally, we write the file to the working directory.

df <- html_products |>
  dplyr::select_all(~gsub("\\s+|\\.", "_", .)) |> 
  dplyr::select_all(tolower) |>
  transform(item_id = as.integer(item_id), 
               price = as.numeric(price))

# create a new xml doc
doc_xml <- newXMLDoc(isHTML = FALSE)

# create a parent node
table_node <- newXMLNode("products", doc = doc_xml)

# row data
row_data <- apply(df, 1, function(x) {
  z1 <- newXMLNode('product') # create a new node for each row
  addChildren(z1, lapply(names(x), function(y) newXMLNode(y, x[y])))
})

# add row data to table node
xmlParent(row_data) <- table_node

# save as xml file
saveXML(doc_xml, file = "df.xml")
## [1] "df.xml"

Reading from XML

To read the file in, I used the function xmlToDataFrame package from the XML package. As with the HTML example, the XML was converted into a dataframe ready for manipulation. Using glimpse we can see that the XML example failed to preserve the variable type “integer” for the item_id column and the “double” for the price column.

xml_file <- "df.xml"
xml_df <- xmlToDataFrame(xml_file)
xml_products <- xml_df$product

glimpse(xml_df)
## Rows: 20
## Columns: 8
## $ category     <chr> "Electronics", "Electronics", "Electronics", "Electronics…
## $ item_name    <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "Refriger…
## $ item_id      <chr> "101", "101", "102", "102", "201", "201", "202", "202", "…
## $ brand        <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "Ho…
## $ price        <chr> " 699.99", " 699.99", "1099.99", "1099.99", " 899.99", " …
## $ variation_id <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-B", "20…
## $ variation    <chr> "Color: Black", "Color: White", "Color: Silver", "Color: …
## $ details      <chr> "Storage: 64GB", "Storage: 128GB", "Storage: 256GB", "Sto…

Exporting As JSON

Using the same dataframe from before, I used the write_json function from the jsonlite package to export my data as json file.

doc_json <- jsonlite::write_json(df, "df.json")

Reading from JSON

I used the fromJSON function from the jsonlite package to read my data from JSON format into a dataframe. Unlike the previous two examples, the JSON file was able to preserve the variable type “integer” for the item_id column and the “double” for the price column.

json_file <- "cuny-df-data.json"
df_products <- fromJSON("df.json", flatten=TRUE)
glimpse(df_products)
## Rows: 20
## Columns: 8
## $ category     <chr> "Electronics", "Electronics", "Electronics", "Electronics…
## $ item_name    <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "Refriger…
## $ item_id      <int> 101, 101, 102, 102, 201, 201, 202, 202, 301, 301, 301, 30…
## $ brand        <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "Ho…
## $ price        <dbl> 699.99, 699.99, 1099.99, 1099.99, 899.99, 899.99, 499.99,…
## $ variation_id <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-B", "20…
## $ variation    <chr> "Color: Black", "Color: White", "Color: Silver", "Color: …
## $ details      <chr> "Storage: 64GB", "Storage: 128GB", "Storage: 256GB", "Sto…

Exporting as Parquet

I used the write_parquet function from the Arrow package to export my data from a parquet file as shown below:

write_parquet(df, "df.parquet")

Importing as Parquet

I used the read_parquet function from the arrow package to import my data from a parquet file into a dataframe. As with the JSON example, the parquet example preserved my variable types.

df_parquet <- read_parquet("df.parquet") 

glimpse(df_parquet)
## Rows: 20
## Columns: 8
## $ category     <chr> "Electronics", "Electronics", "Electronics", "Electronics…
## $ item_name    <chr> "Smartphone", "Smartphone", "Laptop", "Laptop", "Refriger…
## $ item_id      <int> 101, 101, 102, 102, 201, 201, 202, 202, 301, 301, 301, 30…
## $ brand        <chr> "TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "Ho…
## $ price        <dbl> 699.99, 699.99, 1099.99, 1099.99, 899.99, 899.99, 499.99,…
## $ variation_id <chr> "101-A", "101-B", "102-A", "102-B", "201-A", "201-B", "20…
## $ variation    <chr> "Color: Black", "Color: White", "Color: Silver", "Color: …
## $ details      <chr> "Storage: 64GB", "Storage: 128GB", "Storage: 256GB", "Sto…

Prepping data for analysis

Now that we have imported the data into a dataframe, we can proceed with cleaning it up and also transforming into tidy format. We can

df2 <- df |>
  extract(
    col="variation_id",
    into="var_id",
    regex="([a-zA-Z])$"
  ) 

# pull out product info into its own dataframe
prods_df <- subset(df2, select = c(item_id, var_id, item_name, brand, price, category))

head(prods_df, n=2)
##   item_id var_id  item_name     brand  price    category
## 1     101      A Smartphone TechBrand 699.99 Electronics
## 2     101      B Smartphone TechBrand 699.99 Electronics
vars_df <- df2 |>
  pivot_longer(
    cols = c(variation, details),
    names_to = 'temp_variable',
    values_to = 'temp_value',
  ) |>
  separate(
    col="temp_value", 
    into=c("variable", "value"),
    sep = ":"
  ) |>
  dplyr::mutate(value = str_trim(value)) |>
  subset(select = c(item_id, var_id, category, variable, value))

  head(vars_df,n=10)
## # A tibble: 10 × 5
##    item_id var_id category        variable value          
##      <int> <chr>  <chr>           <chr>    <chr>          
##  1     101 A      Electronics     Color    Black          
##  2     101 A      Electronics     Storage  64GB           
##  3     101 B      Electronics     Color    White          
##  4     101 B      Electronics     Storage  128GB          
##  5     102 A      Electronics     Color    Silver         
##  6     102 A      Electronics     Storage  256GB          
##  7     102 B      Electronics     Color    Space Gray     
##  8     102 B      Electronics     Storage  512GB          
##  9     201 A      Home Appliances Color    Stainless Steel
## 10     201 A      Home Appliances Capacity 20 cu ft

Once we have it in this format, we can easily group, summarize and create our analysis.

Conclusion

All formats have R packages that can reformat data to and from R datagframes into their respective file formats. However, JSON and Parquet files proved to be somewhat more effective as it retained the variable types for the item_id and the price. Comparing file sizes is hard for our dataset, as all file sizes are roughly the same size, with the JSON and HTML file weighting 4k, the parquet file 5k and the XML file 6k. However, this is misleading as our HTML file only contains the table with our data and does not contain the typical things that one might encounter on a real website with additional markup and content that might inflate the size. I expect that as the dataset gets larger, we would have more noticeable differences between file sizes with JSON and parquet having the smallest size and being the most performant.