Data 607 Assignment: working with JSON, HTML, XML, and Parquet in R You have received the following data from CUNYMart, located at 123 Example Street, Anytown, USA.
Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details Electronics,Smartphone,101,TechBrand,699.99,101-A,Color: Black, Storage: 64GB Electronics,Smartphone,101,TechBrand,699.99,101-B,Color: White, Storage: 128GB Electronics,Laptop,102,CompuBrand,1099.99,102-A,Color: Silver, Storage: 256GB Electronics,Laptop,102,CompuBrand,1099.99,102-B,Color: Space Gray, Storage: 512GB Home Appliances,Refrigerator,201,HomeCool,899.99,201-A,Color: Stainless Steel, Capacity: 20 cu ft Home Appliances,Refrigerator,201,HomeCool,899.99,201-B,Color: White, Capacity: 18 cu ft Home Appliances,Washing Machine,202,CleanTech,499.99,202-A,Type: Front Load, Capacity: 4.5 cu ft Home Appliances,Washing Machine,202,CleanTech,499.99,202-B,Type: Top Load, Capacity: 5.0 cu ft Clothing,T-Shirt,301,FashionCo,19.99,301-A,Color: Blue, Size: S Clothing,T-Shirt,301,FashionCo,19.99,301-B,Color: Red, Size: M Clothing,T-Shirt,301,FashionCo,19.99,301-C,Color: Green, Size: L Clothing,Jeans,302,DenimWorks,49.99,302-A,Color: Dark Blue, Size: 32 Clothing,Jeans,302,DenimWorks,49.99,302-B,Color: Light Blue, Size: 34 Books,Fiction Novel,401,-,14.99,401-A,Format: Hardcover, Language: English Books,Fiction Novel,401,-,14.99,401-B,Format: Paperback, Language: Spanish Books,Non-Fiction Guide,402,-,24.99,402-A,Format: eBook, Language: English Books,Non-Fiction Guide,402,-,24.99,402-B,Format: Paperback, Language: French Sports Equipment,Basketball,501,SportsGear,29.99,501-A,Size: Size 7, Color: Orange Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-A,Material: Graphite, Color: Black Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-B,Material: Aluminum, Color: Silver
This data will be used for inventory analysis at the retailer. You are required to prepare the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, provide the pros and cons of each format. Your must include R code for generating and importing the data into R.
This is a resubmission attempt.
In this week’s assignment I will be preparing data for analysis in JSON, HTML, XML, and Parquet. I will take the given data, place it into a datatable and format it into JSON, HTML, XML, and Parquet.
library(jsonlite)
#Importing the data into a data table in R.
inventory <- data.frame(
Category = c("Electronics", "Electronics", "Electronics", "Electronics", "Home Appliances", "Home Appliances", "Home Appliances", "Home Appliances","Clothing", "Clothing", "Clothing", "Clothing", "Clothing","Books", "Books", "Books", "Books","Sports Equipment", "Sports Equipment", "Sports Equipment"),
`Item Name` = c("Smartphone", "Smartphone", "Laptop", "Laptop", "Refrigerator", "Refrigerator", "Washing Machine", "Washing Machine","T-Shirt", "T-Shirt", "T-Shirt", "Jeans", "Jeans","Fiction Novel", "Fiction Novel","Non-Fiction Guide", "Non-Fiction Guide", "Basketball", "Tennis Racket", "Tennis Racket"),
`Item ID` = c(101, 101, 102, 102, 201, 201, 202, 202, 301, 301, 301, 302, 302, 401, 401, 402, 402, 501, 502, 502),
Brand = c("TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "HomeCool", "HomeCool", "CleanTech", "CleanTech", "FashionCo", "FashionCo", "FashionCo", "DenimWorks", "DenimWorks", "-", "-", "-", "-", "SportsGear", "RacketPro", "RacketPro"),
Price = c(699.99, 699.99, 1099.99, 1099.99, 899.99, 899.99, 499.99, 499.99, 19.99, 19.99, 19.99, 49.99, 49.99, 14.99, 14.99, 24.99, 24.99, 29.99, 89.99, 89.99),
`Variation ID` = c("101-A", "101-B", "102-A", "102-B", "201-A", "201-B", "202-A", "202-B", "301-A", "301-B", "301-C", "302-A", "302-B", "401-A", "401-B", "402-A", "402-B", "501-A", "502-A", "502-B"),
`Variation Details` = c("Color: Black, Storage: 64GB", "Color: White, Storage: 128GB", "Color: Silver, Storage: 256GB", "Color: Space Gray, Storage: 512GB", "Color: Stainless Steel, Capacity: 20 cu ft", "Color: White, Capacity: 18 cu ft", "Type: Front Load, Capacity: 4.5 cu ft", "Type: Top Load, Capacity: 5.0 cu ft", "Color: Blue, Size: S", "Color: Red, Size: M", "Color: Green, Size: L", "Color: Dark Blue, Size: 32", "Color: Light Blue, Size: 34", "Format: Hardcover, Language: English", "Format: Paperback, Language: Spanish", "Format: eBook, Language: English", "Format: Paperback, Language: French", "Size: Size 7, Color: Orange", "Material: Graphite, Color: Black", "Material: Aluminum, Color: Silver")
)
# Convert dataframe to JSON
json_data <- toJSON(inventory)
write(json_data, file = "data.json")
head(json_data)
## [{"Category":"Electronics","Item.Name":"Smartphone","Item.ID":101,"Brand":"TechBrand","Price":699.99,"Variation.ID":"101-A","Variation.Details":"Color: Black, Storage: 64GB"},{"Category":"Electronics","Item.Name":"Smartphone","Item.ID":101,"Brand":"TechBrand","Price":699.99,"Variation.ID":"101-B","Variation.Details":"Color: White, Storage: 128GB"},{"Category":"Electronics","Item.Name":"Laptop","Item.ID":102,"Brand":"CompuBrand","Price":1099.99,"Variation.ID":"102-A","Variation.Details":"Color: Silver, Storage: 256GB"},{"Category":"Electronics","Item.Name":"Laptop","Item.ID":102,"Brand":"CompuBrand","Price":1099.99,"Variation.ID":"102-B","Variation.Details":"Color: Space Gray, Storage: 512GB"},{"Category":"Home Appliances","Item.Name":"Refrigerator","Item.ID":201,"Brand":"HomeCool","Price":899.99,"Variation.ID":"201-A","Variation.Details":"Color: Stainless Steel, Capacity: 20 cu ft"},{"Category":"Home Appliances","Item.Name":"Refrigerator","Item.ID":201,"Brand":"HomeCool","Price":899.99,"Variation.ID":"201-B","Variation.Details":"Color: White, Capacity: 18 cu ft"},{"Category":"Home Appliances","Item.Name":"Washing Machine","Item.ID":202,"Brand":"CleanTech","Price":499.99,"Variation.ID":"202-A","Variation.Details":"Type: Front Load, Capacity: 4.5 cu ft"},{"Category":"Home Appliances","Item.Name":"Washing Machine","Item.ID":202,"Brand":"CleanTech","Price":499.99,"Variation.ID":"202-B","Variation.Details":"Type: Top Load, Capacity: 5.0 cu ft"},{"Category":"Clothing","Item.Name":"T-Shirt","Item.ID":301,"Brand":"FashionCo","Price":19.99,"Variation.ID":"301-A","Variation.Details":"Color: Blue, Size: S"},{"Category":"Clothing","Item.Name":"T-Shirt","Item.ID":301,"Brand":"FashionCo","Price":19.99,"Variation.ID":"301-B","Variation.Details":"Color: Red, Size: M"},{"Category":"Clothing","Item.Name":"T-Shirt","Item.ID":301,"Brand":"FashionCo","Price":19.99,"Variation.ID":"301-C","Variation.Details":"Color: Green, Size: L"},{"Category":"Clothing","Item.Name":"Jeans","Item.ID":302,"Brand":"DenimWorks","Price":49.99,"Variation.ID":"302-A","Variation.Details":"Color: Dark Blue, Size: 32"},{"Category":"Clothing","Item.Name":"Jeans","Item.ID":302,"Brand":"DenimWorks","Price":49.99,"Variation.ID":"302-B","Variation.Details":"Color: Light Blue, Size: 34"},{"Category":"Books","Item.Name":"Fiction Novel","Item.ID":401,"Brand":"-","Price":14.99,"Variation.ID":"401-A","Variation.Details":"Format: Hardcover, Language: English"},{"Category":"Books","Item.Name":"Fiction Novel","Item.ID":401,"Brand":"-","Price":14.99,"Variation.ID":"401-B","Variation.Details":"Format: Paperback, Language: Spanish"},{"Category":"Books","Item.Name":"Non-Fiction Guide","Item.ID":402,"Brand":"-","Price":24.99,"Variation.ID":"402-A","Variation.Details":"Format: eBook, Language: English"},{"Category":"Books","Item.Name":"Non-Fiction Guide","Item.ID":402,"Brand":"-","Price":24.99,"Variation.ID":"402-B","Variation.Details":"Format: Paperback, Language: French"},{"Category":"Sports Equipment","Item.Name":"Basketball","Item.ID":501,"Brand":"SportsGear","Price":29.99,"Variation.ID":"501-A","Variation.Details":"Size: Size 7, Color: Orange"},{"Category":"Sports Equipment","Item.Name":"Tennis Racket","Item.ID":502,"Brand":"RacketPro","Price":89.99,"Variation.ID":"502-A","Variation.Details":"Material: Graphite, Color: Black"},{"Category":"Sports Equipment","Item.Name":"Tennis Racket","Item.ID":502,"Brand":"RacketPro","Price":89.99,"Variation.ID":"502-B","Variation.Details":"Material: Aluminum, Color: Silver"}]
Converting the data table from Rstudio into a Json file placed the entire table into a single line of code in my output file. Copy of part of the Json file below:
[{“Category”:“Electronics”,“Item.Name”:“Smartphone”,“Item.ID”:101,
The Json file begins with a square bracket “[”, separates each row by curly brackets “{”, and a comma separates each value in the row. Then each value in the row is denoted by its column header followed by a “:” then the value, for example
{“Category”:“Electronics”,“Item.Name”:“Smartphone”
“{” shows the beginning of a row, and under the column header “Category” we have the value “Electronics”. Then the next value in the row is “smartphone” under the column header “Item.Name”
# Convert dataframe to HTML
library(xtable)
html_data <- xtable(inventory)
print(html_data, type = "html", file = "data.html")
library("xml2")
html_table <- read_html("https://raw.githubusercontent.com/Chung-Brandon/607/refs/heads/main/data.html.html")
html_table
## {html_document}
## <html>
## [1] <body><table border="1">\n<tr>\n<th> </th> <th> Category </th> <th> Item ...
Pulling the first 6 lines of HTML data:
Category | Item.Name | Item.ID | Brand | Price | Variation.ID | Variation.Details | |
---|---|---|---|---|---|---|---|
1 | Electronics | Smartphone | 101.00 | TechBrand | 699.99 | 101-A | Color: Black, Storage: 64GB |
2 | Electronics | Smartphone | 101.00 | TechBrand | 699.99 | 101-B | Color: White, Storage: 128GB |
3 | Electronics | Laptop | 102.00 | CompuBrand | 1099.99 | 102-A | Color: Silver, Storage: 256GB |
4 | Electronics | Laptop | 102.00 | CompuBrand | 1099.99 | 102-B | Color: Space Gray, Storage: 512GB |
. In general, html format displays tables in the same structure as an RStudio data table where each value in a row in Rstuido is shown in a corresponding row in html format.
A notable feature about html is that formatting is shown in an HTML file as well as the table data.appears in our dataset in the beginning of every row because of the right align that is used on the row number.
# Convert dataframe to XML
library(XML)
xml_data <- newXMLDoc()
root_node <- newXMLNode("inventory", doc = xml_data)
for (i in 1:nrow(inventory)) {
item_node <- newXMLNode("item", parent = root_node)
for (col in names(inventory)) {
newXMLNode(col, inventory[i, col], parent = item_node)
}
}
# Save the XML
saveXML(xml_data, file = "data.xml")
## [1] "data.xml"
print(xml_data)
## <?xml version="1.0"?>
## <inventory>
## <item>
## <Category>Electronics</Category>
## <Item.Name>Smartphone</Item.Name>
## <Item.ID>101</Item.ID>
## <Brand>TechBrand</Brand>
## <Price>699.99</Price>
## <Variation.ID>101-A</Variation.ID>
## <Variation.Details>Color: Black, Storage: 64GB</Variation.Details>
## </item>
## <item>
## <Category>Electronics</Category>
## <Item.Name>Smartphone</Item.Name>
## <Item.ID>101</Item.ID>
## <Brand>TechBrand</Brand>
## <Price>699.99</Price>
## <Variation.ID>101-B</Variation.ID>
## <Variation.Details>Color: White, Storage: 128GB</Variation.Details>
## </item>
## <item>
## <Category>Electronics</Category>
## <Item.Name>Laptop</Item.Name>
## <Item.ID>102</Item.ID>
## <Brand>CompuBrand</Brand>
## <Price>1099.99</Price>
## <Variation.ID>102-A</Variation.ID>
## <Variation.Details>Color: Silver, Storage: 256GB</Variation.Details>
## </item>
## <item>
## <Category>Electronics</Category>
## <Item.Name>Laptop</Item.Name>
## <Item.ID>102</Item.ID>
## <Brand>CompuBrand</Brand>
## <Price>1099.99</Price>
## <Variation.ID>102-B</Variation.ID>
## <Variation.Details>Color: Space Gray, Storage: 512GB</Variation.Details>
## </item>
## <item>
## <Category>Home Appliances</Category>
## <Item.Name>Refrigerator</Item.Name>
## <Item.ID>201</Item.ID>
## <Brand>HomeCool</Brand>
## <Price>899.99</Price>
## <Variation.ID>201-A</Variation.ID>
## <Variation.Details>Color: Stainless Steel, Capacity: 20 cu ft</Variation.Details>
## </item>
## <item>
## <Category>Home Appliances</Category>
## <Item.Name>Refrigerator</Item.Name>
## <Item.ID>201</Item.ID>
## <Brand>HomeCool</Brand>
## <Price>899.99</Price>
## <Variation.ID>201-B</Variation.ID>
## <Variation.Details>Color: White, Capacity: 18 cu ft</Variation.Details>
## </item>
## <item>
## <Category>Home Appliances</Category>
## <Item.Name>Washing Machine</Item.Name>
## <Item.ID>202</Item.ID>
## <Brand>CleanTech</Brand>
## <Price>499.99</Price>
## <Variation.ID>202-A</Variation.ID>
## <Variation.Details>Type: Front Load, Capacity: 4.5 cu ft</Variation.Details>
## </item>
## <item>
## <Category>Home Appliances</Category>
## <Item.Name>Washing Machine</Item.Name>
## <Item.ID>202</Item.ID>
## <Brand>CleanTech</Brand>
## <Price>499.99</Price>
## <Variation.ID>202-B</Variation.ID>
## <Variation.Details>Type: Top Load, Capacity: 5.0 cu ft</Variation.Details>
## </item>
## <item>
## <Category>Clothing</Category>
## <Item.Name>T-Shirt</Item.Name>
## <Item.ID>301</Item.ID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <Variation.ID>301-A</Variation.ID>
## <Variation.Details>Color: Blue, Size: S</Variation.Details>
## </item>
## <item>
## <Category>Clothing</Category>
## <Item.Name>T-Shirt</Item.Name>
## <Item.ID>301</Item.ID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <Variation.ID>301-B</Variation.ID>
## <Variation.Details>Color: Red, Size: M</Variation.Details>
## </item>
## <item>
## <Category>Clothing</Category>
## <Item.Name>T-Shirt</Item.Name>
## <Item.ID>301</Item.ID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <Variation.ID>301-C</Variation.ID>
## <Variation.Details>Color: Green, Size: L</Variation.Details>
## </item>
## <item>
## <Category>Clothing</Category>
## <Item.Name>Jeans</Item.Name>
## <Item.ID>302</Item.ID>
## <Brand>DenimWorks</Brand>
## <Price>49.99</Price>
## <Variation.ID>302-A</Variation.ID>
## <Variation.Details>Color: Dark Blue, Size: 32</Variation.Details>
## </item>
## <item>
## <Category>Clothing</Category>
## <Item.Name>Jeans</Item.Name>
## <Item.ID>302</Item.ID>
## <Brand>DenimWorks</Brand>
## <Price>49.99</Price>
## <Variation.ID>302-B</Variation.ID>
## <Variation.Details>Color: Light Blue, Size: 34</Variation.Details>
## </item>
## <item>
## <Category>Books</Category>
## <Item.Name>Fiction Novel</Item.Name>
## <Item.ID>401</Item.ID>
## <Brand>-</Brand>
## <Price>14.99</Price>
## <Variation.ID>401-A</Variation.ID>
## <Variation.Details>Format: Hardcover, Language: English</Variation.Details>
## </item>
## <item>
## <Category>Books</Category>
## <Item.Name>Fiction Novel</Item.Name>
## <Item.ID>401</Item.ID>
## <Brand>-</Brand>
## <Price>14.99</Price>
## <Variation.ID>401-B</Variation.ID>
## <Variation.Details>Format: Paperback, Language: Spanish</Variation.Details>
## </item>
## <item>
## <Category>Books</Category>
## <Item.Name>Non-Fiction Guide</Item.Name>
## <Item.ID>402</Item.ID>
## <Brand>-</Brand>
## <Price>24.99</Price>
## <Variation.ID>402-A</Variation.ID>
## <Variation.Details>Format: eBook, Language: English</Variation.Details>
## </item>
## <item>
## <Category>Books</Category>
## <Item.Name>Non-Fiction Guide</Item.Name>
## <Item.ID>402</Item.ID>
## <Brand>-</Brand>
## <Price>24.99</Price>
## <Variation.ID>402-B</Variation.ID>
## <Variation.Details>Format: Paperback, Language: French</Variation.Details>
## </item>
## <item>
## <Category>Sports Equipment</Category>
## <Item.Name>Basketball</Item.Name>
## <Item.ID>501</Item.ID>
## <Brand>SportsGear</Brand>
## <Price>29.99</Price>
## <Variation.ID>501-A</Variation.ID>
## <Variation.Details>Size: Size 7, Color: Orange</Variation.Details>
## </item>
## <item>
## <Category>Sports Equipment</Category>
## <Item.Name>Tennis Racket</Item.Name>
## <Item.ID>502</Item.ID>
## <Brand>RacketPro</Brand>
## <Price>89.99</Price>
## <Variation.ID>502-A</Variation.ID>
## <Variation.Details>Material: Graphite, Color: Black</Variation.Details>
## </item>
## <item>
## <Category>Sports Equipment</Category>
## <Item.Name>Tennis Racket</Item.Name>
## <Item.ID>502</Item.ID>
## <Brand>RacketPro</Brand>
## <Price>89.99</Price>
## <Variation.ID>502-B</Variation.ID>
## <Variation.Details>Material: Aluminum, Color: Silver</Variation.Details>
## </item>
## </inventory>
##
XML format uses the same <> to denote values of rows as html but the structure of the XML is not made to be similar to the table like html is. In XML each code chunk is a row of data from our Rstudio table. XML seems similar to html in the uses of <> and the fact that XML also includes value format information in its code.
library(arrow)
##
## Attaching package: 'arrow'
## The following object is masked from 'package:utils':
##
## timestamp
write_parquet(inventory, "data.parquet")
parquet_data <- read_parquet("C:\\Users\\chung\\Documents\\607\\data.parquet")
My parquet data can be found here:
https://github.com/Chung-Brandon/607/blob/main/data.parquet
Pulling out the first few lines from this file this is what we see:
PAR1–œL KðJ Electronics Home Appliances Clothing Books Sports Equipment “&,(6 (Sports EquipmentBooks @ ( $’¤m# &Æ5 Category(´¾&Ä&6 (Sports EquipmentBooks , ¨°L ”ð“ Smartphone Laptop
The Parquet data is not formatted in an intuitive way like the other file formats, and does not seem to have a structure that is not easily readable. The reading of this file requires a computer to decode it - especially when thinking about larger datasets.
Pros and Cons of each format: 1. JSON - PROS: Readable, widely used in web services CONS: Large file sizes.
HTML - PROS: Readable, Good for data display, easily viewable in web browsers. CONS: not good for analysis, storage of large data sets.
XML - PROS: Good for complex nested structures CONS: larger file size, slower parsing than JSON.
Parquet - Highly efficient storage and query performance, supported by big data tools. CONS: Not human readable