This assignment is for working with JSON, HTML, XML, and Parquet in R.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(repurrrsive)
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
library(xml2)
library(XML)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(arrow)
##
## Attaching package: 'arrow'
##
## The following object is masked from 'package:lubridate':
##
## duration
##
## The following object is masked from 'package:utils':
##
## timestamp
I first pasted the text from the PDF into a CSV, and imported the
original data via the usual read_csv
:
df <- read_csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-607/refs/heads/main/assignment_7_text.csv")
## Rows: 20 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df, 5)
# Separate into columns
cunymart <- df %>%
separate_wider_delim(`Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details`, delim = ",", names = c("Category", "Item Name", "Item ID", "Brand", "Price", "Variation ID", "Variation Details"), too_many = "merge")
head(cunymart, 5)
This data frame now matches the original data in the PDF. This data frame is tidy since each row represents a Variation ID for a particular item. If any new item is created, then a new row can be added. Each feature of the item is now a column.
I would probably separate the Variation Details
into 2
separate columns to make it easier to read, but for the purpose of this
assignment I won’t alter the original data.
Now let’s convert the data frame into the different file types.
For each file type, I save the new converted file to the
Downloads
folder of the person running the markdown file.
Note: I then use this file type directly for the importing section later
on.
I go into the pros and cons of each file type in the
Conclusions
section.
Let’s convert the data frame:
JSON:
# Convert to JSON
print(toJSON(as.data.frame(cunymart)))
## [{"Category":"Electronics","Item Name":"Smartphone","Item ID":"101","Brand":"TechBrand","Price":"699.99","Variation ID":"101-A","Variation Details":"Color: Black, Storage: 64GB"},{"Category":"Electronics","Item Name":"Smartphone","Item ID":"101","Brand":"TechBrand","Price":"699.99","Variation ID":"101-B","Variation Details":"Color: White, Storage: 128GB"},{"Category":"Electronics","Item Name":"Laptop","Item ID":"102","Brand":"CompuBrand","Price":"1099.99","Variation ID":"102-A","Variation Details":"Color: Silver, Storage: 256GB"},{"Category":"Electronics","Item Name":"Laptop","Item ID":"102","Brand":"CompuBrand","Price":"1099.99","Variation ID":"102-B","Variation Details":"Color: Space Gray, Storage: 512GB"},{"Category":"Home Appliances","Item Name":"Refrigerator","Item ID":"201","Brand":"HomeCool","Price":"899.99","Variation ID":"201-A","Variation Details":"Color: Stainless Steel, Capacity: 20 cu ft"},{"Category":"Home Appliances","Item Name":"Refrigerator","Item ID":"201","Brand":"HomeCool","Price":"899.99","Variation ID":"201-B","Variation Details":"Color: White, Capacity: 18 cu ft"},{"Category":"Home Appliances","Item Name":"Washing Machine","Item ID":"202","Brand":"CleanTech","Price":"499.99","Variation ID":"202-A","Variation Details":"Type: Front Load, Capacity: 4.5 cu ft"},{"Category":"Home Appliances","Item Name":"Washing Machine","Item ID":"202","Brand":"CleanTech","Price":"499.99","Variation ID":"202-B","Variation Details":"Type: Top Load, Capacity: 5.0 cu ft"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-A","Variation Details":"Color: Blue, Size: S"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-B","Variation Details":"Color: Red, Size: M"},{"Category":"Clothing","Item Name":"T-Shirt","Item ID":"301","Brand":"FashionCo","Price":"19.99","Variation ID":"301-C","Variation Details":"Color: Green, Size: L"},{"Category":"Clothing","Item Name":"Jeans","Item ID":"302","Brand":"DenimWorks","Price":"49.99","Variation ID":"302-A","Variation Details":"Color: Dark Blue, Size: 32"},{"Category":"Clothing","Item Name":"Jeans","Item ID":"302","Brand":"DenimWorks","Price":"49.99","Variation ID":"302-B","Variation Details":"Color: Light Blue, Size: 34"},{"Category":"Books","Item Name":"Fiction Novel","Item ID":"401","Brand":"-","Price":"14.99","Variation ID":"401-A","Variation Details":"Format: Hardcover, Language: English"},{"Category":"Books","Item Name":"Fiction Novel","Item ID":"401","Brand":"-","Price":"14.99","Variation ID":"401-B","Variation Details":"Format: Paperback, Language: Spanish"},{"Category":"Books","Item Name":"Non-Fiction Guide","Item ID":"402","Brand":"-","Price":"24.99","Variation ID":"402-A","Variation Details":"Format: eBook, Language: English"},{"Category":"Books","Item Name":"Non-Fiction Guide","Item ID":"402","Brand":"-","Price":"24.99","Variation ID":"402-B","Variation Details":"Format: Paperback, Language: French"},{"Category":"Sports Equipment","Item Name":"Basketball","Item ID":"501","Brand":"SportsGear","Price":"29.99","Variation ID":"501-A","Variation Details":"Size: Size 7, Color: Orange"},{"Category":"Sports Equipment","Item Name":"Tennis Racket","Item ID":"502","Brand":"RacketPro","Price":"89.99","Variation ID":"502-A","Variation Details":"Material: Graphite, Color: Black"},{"Category":"Sports Equipment","Item Name":"Tennis Racket","Item ID":"502","Brand":"RacketPro","Price":"89.99","Variation ID":"502-B","Variation Details":"Material: Aluminum, Color: Silver"}]
write(toJSON(as.data.frame(cunymart)), "~/Downloads/cunymart.json")
HTML:
# Convert to HTML
html_content <- "<html><body><table border='1'><tr><th>Cateogry</th><th>Item Name</th><th>Item ID</th><th>Brand</th><th>Price</th><th>Variation ID</th><th>Variation Details</th>"
for (i in 1:nrow(cunymart)) {
html_content <- paste0(html_content, "<tr>")
for (j in 1:ncol(cunymart)) {
html_content <- paste0(html_content, "<td>", cunymart[i, j], "</td>")
}
html_content <- paste0(html_content, "</tr>")
}
html_content <- paste0(html_content, "</html></body></table>")
print(html_content)
## [1] "<html><body><table border='1'><tr><th>Cateogry</th><th>Item Name</th><th>Item ID</th><th>Brand</th><th>Price</th><th>Variation ID</th><th>Variation Details</th><tr><td>Electronics</td><td>Smartphone</td><td>101</td><td>TechBrand</td><td>699.99</td><td>101-A</td><td>Color: Black, Storage: 64GB</td></tr><tr><td>Electronics</td><td>Smartphone</td><td>101</td><td>TechBrand</td><td>699.99</td><td>101-B</td><td>Color: White, Storage: 128GB</td></tr><tr><td>Electronics</td><td>Laptop</td><td>102</td><td>CompuBrand</td><td>1099.99</td><td>102-A</td><td>Color: Silver, Storage: 256GB</td></tr><tr><td>Electronics</td><td>Laptop</td><td>102</td><td>CompuBrand</td><td>1099.99</td><td>102-B</td><td>Color: Space Gray, Storage: 512GB</td></tr><tr><td>Home Appliances</td><td>Refrigerator</td><td>201</td><td>HomeCool</td><td>899.99</td><td>201-A</td><td>Color: Stainless Steel, Capacity: 20 cu ft</td></tr><tr><td>Home Appliances</td><td>Refrigerator</td><td>201</td><td>HomeCool</td><td>899.99</td><td>201-B</td><td>Color: White, Capacity: 18 cu ft</td></tr><tr><td>Home Appliances</td><td>Washing Machine</td><td>202</td><td>CleanTech</td><td>499.99</td><td>202-A</td><td>Type: Front Load, Capacity: 4.5 cu ft</td></tr><tr><td>Home Appliances</td><td>Washing Machine</td><td>202</td><td>CleanTech</td><td>499.99</td><td>202-B</td><td>Type: Top Load, Capacity: 5.0 cu ft</td></tr><tr><td>Clothing</td><td>T-Shirt</td><td>301</td><td>FashionCo</td><td>19.99</td><td>301-A</td><td>Color: Blue, Size: S</td></tr><tr><td>Clothing</td><td>T-Shirt</td><td>301</td><td>FashionCo</td><td>19.99</td><td>301-B</td><td>Color: Red, Size: M</td></tr><tr><td>Clothing</td><td>T-Shirt</td><td>301</td><td>FashionCo</td><td>19.99</td><td>301-C</td><td>Color: Green, Size: L</td></tr><tr><td>Clothing</td><td>Jeans</td><td>302</td><td>DenimWorks</td><td>49.99</td><td>302-A</td><td>Color: Dark Blue, Size: 32</td></tr><tr><td>Clothing</td><td>Jeans</td><td>302</td><td>DenimWorks</td><td>49.99</td><td>302-B</td><td>Color: Light Blue, Size: 34</td></tr><tr><td>Books</td><td>Fiction Novel</td><td>401</td><td>-</td><td>14.99</td><td>401-A</td><td>Format: Hardcover, Language: English</td></tr><tr><td>Books</td><td>Fiction Novel</td><td>401</td><td>-</td><td>14.99</td><td>401-B</td><td>Format: Paperback, Language: Spanish</td></tr><tr><td>Books</td><td>Non-Fiction Guide</td><td>402</td><td>-</td><td>24.99</td><td>402-A</td><td>Format: eBook, Language: English</td></tr><tr><td>Books</td><td>Non-Fiction Guide</td><td>402</td><td>-</td><td>24.99</td><td>402-B</td><td>Format: Paperback, Language: French</td></tr><tr><td>Sports Equipment</td><td>Basketball</td><td>501</td><td>SportsGear</td><td>29.99</td><td>501-A</td><td>Size: Size 7, Color: Orange</td></tr><tr><td>Sports Equipment</td><td>Tennis Racket</td><td>502</td><td>RacketPro</td><td>89.99</td><td>502-A</td><td>Material: Graphite, Color: Black</td></tr><tr><td>Sports Equipment</td><td>Tennis Racket</td><td>502</td><td>RacketPro</td><td>89.99</td><td>502-B</td><td>Material: Aluminum, Color: Silver</td></tr></html></body></table>"
writeLines(html_content, "~/Downloads/cunymart.html")
XML:
# Convert to XML
xml_content <- xml_new_document()
cunymart_node <- xml_add_child(xml_content, "cunymart")
for (i in 1:nrow(cunymart)) {
single_cunymart_node <- xml_add_child(cunymart_node, "Item")
xml_add_child(single_cunymart_node, "Category", cunymart[i, "Category"])
xml_add_child(single_cunymart_node, "ItemName", cunymart[i, "Item Name"])
xml_add_child(single_cunymart_node, "ItemID", cunymart[i, "Item ID"])
xml_add_child(single_cunymart_node, "Brand", cunymart[i, "Brand"])
xml_add_child(single_cunymart_node, "Price", cunymart[i, "Price"])
xml_add_child(single_cunymart_node, "VariationID", cunymart[i, "Variation ID"])
xml_add_child(single_cunymart_node, "VariationDetails", cunymart[i, "Variation Details"])
}
print(xml_content)
## {xml_document}
## <cunymart>
## [1] <Item>\n <Category>list(Category = "Electronics")</Category>\n <ItemNa ...
## [2] <Item>\n <Category>list(Category = "Electronics")</Category>\n <ItemNa ...
## [3] <Item>\n <Category>list(Category = "Electronics")</Category>\n <ItemNa ...
## [4] <Item>\n <Category>list(Category = "Electronics")</Category>\n <ItemNa ...
## [5] <Item>\n <Category>list(Category = "Home Appliances")</Category>\n <It ...
## [6] <Item>\n <Category>list(Category = "Home Appliances")</Category>\n <It ...
## [7] <Item>\n <Category>list(Category = "Home Appliances")</Category>\n <It ...
## [8] <Item>\n <Category>list(Category = "Home Appliances")</Category>\n <It ...
## [9] <Item>\n <Category>list(Category = "Clothing")</Category>\n <ItemName> ...
## [10] <Item>\n <Category>list(Category = "Clothing")</Category>\n <ItemName> ...
## [11] <Item>\n <Category>list(Category = "Clothing")</Category>\n <ItemName> ...
## [12] <Item>\n <Category>list(Category = "Clothing")</Category>\n <ItemName> ...
## [13] <Item>\n <Category>list(Category = "Clothing")</Category>\n <ItemName> ...
## [14] <Item>\n <Category>list(Category = "Books")</Category>\n <ItemName>lis ...
## [15] <Item>\n <Category>list(Category = "Books")</Category>\n <ItemName>lis ...
## [16] <Item>\n <Category>list(Category = "Books")</Category>\n <ItemName>lis ...
## [17] <Item>\n <Category>list(Category = "Books")</Category>\n <ItemName>lis ...
## [18] <Item>\n <Category>list(Category = "Sports Equipment")</Category>\n <I ...
## [19] <Item>\n <Category>list(Category = "Sports Equipment")</Category>\n <I ...
## [20] <Item>\n <Category>list(Category = "Sports Equipment")</Category>\n <I ...
write_xml(xml_content, "~/Downloads/cunymart.xml")
Parquet:
# Convert to Parquet
pq_path <- "~/Downloads"
cunymart |>
write_dataset(path = pq_path, format = "parquet")
Now let’s read the new file types.
Read JSON:
# Import JSON
cunymart_json <- fromJSON("~/Downloads/cunymart.json")
head(cunymart_json, 5)
We now have a clean, tidy data frame.
Read HTML:
# Import HTML
cunymart_html <- read_html("~/Downloads/cunymart.html")
cunymart_html_df <- cunymart_html |>
html_elements("table") |>
html_table() |>
as.data.frame()
head(cunymart_html_df, 5)
We now have a clean, tidy data frame.
Read XML:
When doing research about reading XML files online, I found the
XML
library so that is what I used to import the XML
file:
# Import XML
cunymart_xml <- xmlParse("~/Downloads/cunymart.xml")
# Convert to data frame
xml_df <- xmlToDataFrame(nodes = getNodeSet(cunymart_xml, "//Item")) #Each node/observation is under "Item"
head(xml_df, 5)
We now have our tidy frame, but now each item is a list represented by a string. Let’s clean the data to make it easier to read:
# We only want the values inside the ""
xml_df_clean <- xml_df %>%
mutate(
across(
.cols = everything(),
.fns = ~ str_replace_all(
str_extract(..1,'"(.*)"'),
'\\"',
'')
)
)
head(xml_df_clean, 5)
We now have a clean, tidy data frame.
Read parquet:
pq_path <- "~/Downloads/part-0.parquet"
cunymart_parquet_df <- open_dataset(pq_path)
head(cunymart_parquet_df, 5) |> collect()
We now have a clean, tidy data frame.
This showed the ease of importing and converting data into different file types in R. Here are the pros and cons I found for each file type:
JSON:
Pros:
Cons:
HTML:
Pros:
table
makes it very easy for extracting data
framesCons:
XML:
Pros:
Cons:
Parquet:
Pros:
Cons: