Overview

This assignment is for working with JSON, HTML, XML, and Parquet in R.

Load the Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(repurrrsive)
library(jsonlite)

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

library(xml2)
library(XML)
library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(arrow)

## 
## Attaching package: 'arrow'
## 
## The following object is masked from 'package:lubridate':
## 
##     duration
## 
## The following object is masked from 'package:utils':
## 
##     timestamp

Read the Original Data

I first pasted the text from the PDF into a CSV, and imported the original data via the usual read_csv:

df <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607/refs/heads/main/assignment_7_text.csv")

## Rows: 20 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df, 5)

# Separate into columns
cunymart <- df %>% 
  separate_wider_delim(`Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details`, delim = ",", names = c("Category", "Item Name", "Item ID", "Brand", "Price", "Variation ID", "Variation Details"), too_many = "merge")

head(cunymart, 5)

This data frame now matches the original data in the PDF. This data frame is tidy since each row represents a Variation ID for a particular item. If any new item is created, then a new row can be added. Each feature of the item is now a column.

I would probably separate the Variation Details into 2 separate columns to make it easier to read, but for the purpose of this assignment I won’t alter the original data.

Now let’s convert the data frame into the different file types.

Data Frame Conversion

For each file type, I save the new converted file to the Downloads folder of the person running the markdown file. Note: I then use this file type directly for the importing section later on.

I go into the pros and cons of each file type in the Conclusions section.

Let’s convert the data frame:

JSON:

# Convert to JSON
print(toJSON(as.data.frame(cunymart)))

## [{"Category":"Electronics","Item Name":"Smartphone","Item ID":"101","Brand":"TechBrand","Price":"699.99","Variation ID":"101-A","Variation Details":"Color: Black, Storage: 64GB"},{"Category":"Electronics","Item Name":"Smartphone","Item ID":"101","Brand":"TechBrand","Price":"699.99","Variation ID":"101-B","Variation Details":"Color: White, Storage: 128GB"},{"Category":"Electronics","Item Name":"Laptop","Item ID":"102","Brand":"CompuBrand","Price":"1099.99","Variation ID":"102-A","Variation Details":"Color: Silver, Storage: 256GB"},{"Category":"Electronics","Item Name":"Laptop","Item ID":"102","Brand":"CompuBrand","Price":"1099.99","Variation ID":"102-B","Variation Details":"Color: Space Gray, Storage: 512GB"},{"Category":"Home Appliances","Item Name":"Refrigerator","Item ID":"201","Brand":"HomeCool","Price":"899.99","Variation ID":"201-A","Variation Details":"Color: Stainless Steel, Capacity: 20 cu ft"},{"Category":"Home Appliances","Item Name":"Refrigerator","Item ID":"201","Brand":"HomeCool","Price":"899.99","Variation ID":"201-B","Variation Details":"Color: White, Capacity: 18 cu ft"},{"Category":"Home Appliances","Item Name":"Washing Machine","Item ID":"202","Brand":"CleanTech","Price":"499.99","Variation ID":"202-A","Variation Details":"Type: Front Load, Capacity: 4.5 cu ft"},{"Category":"Home Appliances","Item Name":"Washing Machine","Item ID":"202","Brand":"CleanTech","Price":"499.99","Variation ID":"202-B","Variation Details":"Type: Top Load, Capacity: 5.0 cu ft"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-A","Variation Details":"Color: Blue, Size: S"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-B","Variation Details":"Color: Red, Size: M"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-C","Variation Details":"Color: Green, Size: L"},{"Category":"Clothing","Item Name":"Jeans","Item ID":"302","Brand":"DenimWorks","Price":"49.99","Variation ID":"302-A","Variation Details":"Color: Dark Blue, Size: 32"},{"Category":"Clothing","Item Name":"Jeans","Item ID":"302","Brand":"DenimWorks","Price":"49.99","Variation ID":"302-B","Variation Details":"Color: Light Blue, Size: 34"},{"Category":"Books","Item Name":"Fiction Novel","Item ID":"401","Brand":"-","Price":"14.99","Variation ID":"401-A","Variation Details":"Format: Hardcover, Language: English"},{"Category":"Books","Item Name":"Fiction Novel","Item ID":"401","Brand":"-","Price":"14.99","Variation ID":"401-B","Variation Details":"Format: Paperback, Language: Spanish"},{"Category":"Books","Item Name":"Non-Fiction Guide","Item ID":"402","Brand":"-","Price":"24.99","Variation ID":"402-A","Variation Details":"Format: eBook, Language: English"},{"Category":"Books","Item Name":"Non-Fiction Guide","Item ID":"402","Brand":"-","Price":"24.99","Variation ID":"402-B","Variation Details":"Format: Paperback, Language: French"},{"Category":"Sports Equipment","Item Name":"Basketball","Item ID":"501","Brand":"SportsGear","Price":"29.99","Variation ID":"501-A","Variation Details":"Size: Size 7, Color: Orange"},{"Category":"Sports Equipment","Item Name":"Tennis Racket","Item ID":"502","Brand":"RacketPro","Price":"89.99","Variation ID":"502-A","Variation Details":"Material: Graphite, Color: Black"},{"Category":"Sports Equipment","Item Name":"Tennis Racket","Item ID":"502","Brand":"RacketPro","Price":"89.99","Variation ID":"502-B","Variation Details":"Material: Aluminum, Color: Silver"}]

write(toJSON(as.data.frame(cunymart)), "~/Downloads/cunymart.json")

HTML:

# Convert to HTML
html_content <- "<html><body><table border='1'><tr><th>Cateogry</th><th>Item Name</th><th>Item ID</th><th>Brand</th><th>Price</th><th>Variation ID</th><th>Variation Details</th>"
for (i in 1:nrow(cunymart)) {
  html_content <- paste0(html_content, "<tr>")
  for (j in 1:ncol(cunymart)) {
    html_content <- paste0(html_content, "<td>", cunymart[i, j], "</td>")
  }
  html_content <- paste0(html_content, "</tr>")
}
html_content <- paste0(html_content, "</html></body></table>")
print(html_content)

## [1] "<html><body><table border='1'><tr><th>Cateogry</th><th>Item Name</th><th>Item ID</th><th>Brand</th><th>Price</th><th>Variation ID</th><th>Variation Details</th><tr><td>Electronics</td><td>Smartphone</td><td>101</td><td>TechBrand</td><td>699.99</td><td>101-A</td><td>Color: Black, Storage: 64GB</td></tr><tr><td>Electronics</td><td>Smartphone</td><td>101</td><td>TechBrand</td><td>699.99</td><td>101-B</td><td>Color: White, Storage: 128GB</td></tr><tr><td>Electronics</td><td>Laptop</td><td>102</td><td>CompuBrand</td><td>1099.99</td><td>102-A</td><td>Color: Silver, Storage: 256GB</td></tr><tr><td>Electronics</td><td>Laptop</td><td>102</td><td>CompuBrand</td><td>1099.99</td><td>102-B</td><td>Color: Space Gray, Storage: 512GB</td></tr><tr><td>Home Appliances</td><td>Refrigerator</td><td>201</td><td>HomeCool</td><td>899.99</td><td>201-A</td><td>Color: Stainless Steel, Capacity: 20 cu ft</td></tr><tr><td>Home Appliances</td><td>Refrigerator</td><td>201</td><td>HomeCool</td><td>899.99</td><td>201-B</td><td>Color: White, Capacity: 18 cu ft</td></tr><tr><td>Home Appliances</td><td>Washing Machine</td><td>202</td><td>CleanTech</td><td>499.99</td><td>202-A</td><td>Type: Front Load, Capacity: 4.5 cu ft</td></tr><tr><td>Home Appliances</td><td>Washing Machine</td><td>202</td><td>CleanTech</td><td>499.99</td><td>202-B</td><td>Type: Top Load, Capacity: 5.0 cu ft</td></tr><tr><td>Clothing</td><td>T-Shirt</td><td>301</td><td>FashionCo</td><td>19.99</td><td>301-A</td><td>Color: Blue, Size: S</td></tr><tr><td>Clothing</td><td>T-Shirt</td><td>301</td><td>FashionCo</td><td>19.99</td><td>301-B</td><td>Color: Red, Size: M</td></tr><tr><td>Clothing</td><td>T-Shirt</td><td>301</td><td>FashionCo</td><td>19.99</td><td>301-C</td><td>Color: Green, Size: L</td></tr><tr><td>Clothing</td><td>Jeans</td><td>302</td><td>DenimWorks</td><td>49.99</td><td>302-A</td><td>Color: Dark Blue, Size: 32</td></tr><tr><td>Clothing</td><td>Jeans</td><td>302</td><td>DenimWorks</td><td>49.99</td><td>302-B</td><td>Color: Light Blue, Size: 34</td></tr><tr><td>Books</td><td>Fiction Novel</td><td>401</td><td>-</td><td>14.99</td><td>401-A</td><td>Format: Hardcover, Language: English</td></tr><tr><td>Books</td><td>Fiction Novel</td><td>401</td><td>-</td><td>14.99</td><td>401-B</td><td>Format: Paperback, Language: Spanish</td></tr><tr><td>Books</td><td>Non-Fiction Guide</td><td>402</td><td>-</td><td>24.99</td><td>402-A</td><td>Format: eBook, Language: English</td></tr><tr><td>Books</td><td>Non-Fiction Guide</td><td>402</td><td>-</td><td>24.99</td><td>402-B</td><td>Format: Paperback, Language: French</td></tr><tr><td>Sports Equipment</td><td>Basketball</td><td>501</td><td>SportsGear</td><td>29.99</td><td>501-A</td><td>Size: Size 7, Color: Orange</td></tr><tr><td>Sports Equipment</td><td>Tennis Racket</td><td>502</td><td>RacketPro</td><td>89.99</td><td>502-A</td><td>Material: Graphite, Color: Black</td></tr><tr><td>Sports Equipment</td><td>Tennis Racket</td><td>502</td><td>RacketPro</td><td>89.99</td><td>502-B</td><td>Material: Aluminum, Color: Silver</td></tr></html></body></table>"

writeLines(html_content, "~/Downloads/cunymart.html")

XML:

# Convert to XML
xml_content <- xml_new_document()
cunymart_node <- xml_add_child(xml_content, "cunymart")
for (i in 1:nrow(cunymart)) {
  single_cunymart_node <- xml_add_child(cunymart_node, "Item")
  xml_add_child(single_cunymart_node, "Category", cunymart[i, "Category"])
  xml_add_child(single_cunymart_node, "ItemName", cunymart[i, "Item Name"])
  xml_add_child(single_cunymart_node, "ItemID", cunymart[i, "Item ID"])
  xml_add_child(single_cunymart_node, "Brand", cunymart[i, "Brand"])
  xml_add_child(single_cunymart_node, "Price", cunymart[i, "Price"])
  xml_add_child(single_cunymart_node, "VariationID", cunymart[i, "Variation ID"])
  xml_add_child(single_cunymart_node, "VariationDetails", cunymart[i, "Variation Details"])
}
print(xml_content)

## {xml_document}
## <cunymart>
##  [1] <Item>\n  <Category>list(Category = "Electronics")</Category>\n  <ItemNa ...
##  [2] <Item>\n  <Category>list(Category = "Electronics")</Category>\n  <ItemNa ...
##  [3] <Item>\n  <Category>list(Category = "Electronics")</Category>\n  <ItemNa ...
##  [4] <Item>\n  <Category>list(Category = "Electronics")</Category>\n  <ItemNa ...
##  [5] <Item>\n  <Category>list(Category = "Home Appliances")</Category>\n  <It ...
##  [6] <Item>\n  <Category>list(Category = "Home Appliances")</Category>\n  <It ...
##  [7] <Item>\n  <Category>list(Category = "Home Appliances")</Category>\n  <It ...
##  [8] <Item>\n  <Category>list(Category = "Home Appliances")</Category>\n  <It ...
##  [9] <Item>\n  <Category>list(Category = "Clothing")</Category>\n  <ItemName> ...
## [10] <Item>\n  <Category>list(Category = "Clothing")</Category>\n  <ItemName> ...
## [11] <Item>\n  <Category>list(Category = "Clothing")</Category>\n  <ItemName> ...
## [12] <Item>\n  <Category>list(Category = "Clothing")</Category>\n  <ItemName> ...
## [13] <Item>\n  <Category>list(Category = "Clothing")</Category>\n  <ItemName> ...
## [14] <Item>\n  <Category>list(Category = "Books")</Category>\n  <ItemName>lis ...
## [15] <Item>\n  <Category>list(Category = "Books")</Category>\n  <ItemName>lis ...
## [16] <Item>\n  <Category>list(Category = "Books")</Category>\n  <ItemName>lis ...
## [17] <Item>\n  <Category>list(Category = "Books")</Category>\n  <ItemName>lis ...
## [18] <Item>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <I ...
## [19] <Item>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <I ...
## [20] <Item>\n  <Category>list(Category = "Sports Equipment")</Category>\n  <I ...

write_xml(xml_content, "~/Downloads/cunymart.xml")

Parquet:

# Convert to Parquet
pq_path <- "~/Downloads"
cunymart |>
  write_dataset(path = pq_path, format = "parquet")

Now let’s read the new file types.

Read New File Types

Read JSON:

# Import JSON
cunymart_json <- fromJSON("~/Downloads/cunymart.json")

head(cunymart_json, 5)

We now have a clean, tidy data frame.

Read HTML:

# Import HTML
cunymart_html <- read_html("~/Downloads/cunymart.html")
cunymart_html_df <- cunymart_html |>
  html_elements("table") |>
  html_table() |>
  as.data.frame()

head(cunymart_html_df, 5)

We now have a clean, tidy data frame.

Read XML:

When doing research about reading XML files online, I found the XML library so that is what I used to import the XML file:

# Import XML
cunymart_xml <- xmlParse("~/Downloads/cunymart.xml")

# Convert to data frame
xml_df <- xmlToDataFrame(nodes = getNodeSet(cunymart_xml, "//Item")) #Each node/observation is under "Item"
head(xml_df, 5)

We now have our tidy frame, but now each item is a list represented by a string. Let’s clean the data to make it easier to read:

# We only want the values inside the ""
xml_df_clean <- xml_df %>%
  mutate(
    across(
      .cols = everything(),
      .fns = ~ str_replace_all(
        str_extract(..1,'"(.*)"'),
        '\\"',
        '')
    )
  )

head(xml_df_clean, 5)

We now have a clean, tidy data frame.

Read parquet:

pq_path <- "~/Downloads/part-0.parquet"
cunymart_parquet_df <- open_dataset(pq_path)
head(cunymart_parquet_df, 5) |> collect()

We now have a clean, tidy data frame.

Conclusions

This showed the ease of importing and converting data into different file types in R. Here are the pros and cons I found for each file type:

JSON:

Pros:

Can be easily read and written by machines
Very common file type returned by web APIs
Human readable
Supports hierarchical data
Supported by JavaScript

Cons:

Numbers can sometimes be stored as strings
Can become verbose with nested structures
Queries can become complex

HTML:

Pros:

Has good structure
Using table makes it very easy for extracting data frames
Fast to download
Human readable
Supports hierarchical data

Cons:

rvest doesn’t run JavaScript
Scraping web data can be very complex and difficult
Legal and ethical considerations for web scraping
A lot of code needs to be written to create a data frame
Can become verbose with nested structures

XML:

Pros:

Human readable
Has good structure
Supports hierarchical data

Cons:

A lot of code needs to be written to create a data frame
Can become verbose with nested structures
Larger file size than CSV or JSON

Parquet:

Pros:

Smaller than CSVs - less data to move from disk to memory
Can store type with the data
“column-oriented” similar to a data frame
Parquet files are “chunked”

Cons:

Not human readable
Might not be the best for frequent updating of parquet files
Not the best choice if low latency is important

Week 7 Assignment

Gillian McGovern

2025-03-14

Overview

Load the Libraries

Read the Original Data

Data Frame Conversion

Read New File Types

Conclusions