For this week’s assignment, we experimented with different data types - specifically HTML, JSON, XML, and Parquet. I chose to build the dataframe from CUNYMart in R, and then convert my dataframe to the other datatypes to export. Then I listed some pros and cons of each data type.
Build CUNYMart dataframe in R
For this step I created vectors for Category, Item Name, Item ID, Brand, Price, Variation ID, and Variation Details from the information provide in the PDF. Then I combined all the vectors into one dataframe called dep_store.
cate <- rep(c("Electronics","Home"), each = 4)
cate2 <- rep(c("Clothing"), times = 5)
cate3 <- rep(c("Books"), times = 4)
cate4 <- rep(c("Sports Equipment"), times = 3)
category <- c(cate, cate2, cate3, cate4)
item_name <- c("Smartphone", "Smartphone", "Laptop", "Laptop", "Refrigerator", "Refrigerator", "Washing Machine", "Washing Machine", "T-Shirt", "T-Shirt", "T-Shirt", "Jeans", "Jeans", "Fiction Novel","Fiction Novel", "Non-Fiction Guide", "Non-Fiction Guide", "Basketball", "Tennis Racket", "Tennis Racket")
item_id <- c(101, 101, 102, 102, 201, 201, 202, 202, 301, 301, 301, 302, 302, 401, 401, 402, 402, 501, 502, 502)
brand <- c("TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "HomeCool", "HomeCool", "CleanTech", "CleanTech", "FashionCo", "FashionCo", "FashionCo", "DenimWorks", "DenimWorks", "-", "-", "-", "-", "SportsGear", "RacketPro", "RacketPro")
price <- c("699.99", "699.99", "1099.99", "1099.99", "899.99", "899.99", "499.99", "499.99", "19.99", "19.99", "19.99", "49.99", "49.99", "14.99", "14.99", "24.99", "24.99", "29.99", "89.99", "89.99")
variation_id <- c("101-A", "101-A", "102-A", "102-A", "201-A", "201-B", "202-A", "202-B", "301-A", "301-B", "301-B", "302-A", "302-B", "401-A", "401-B", "402-A", "402-B", "501-A", "502-A", "502-B")
variation_details <- c("Color: Black, Storage: 64GB", "Color: White, Storage: 128GB", "Color: Silver, Storage: 256GB", "Color: Space Gray, Storage: 512GB", "Stainless Steel, Capacity: 20 cu ft", "Color: White, Capacity: 18 cu ft", "Type: Front Load, Capacity: 4.5 cu ft", "Type: Top Load, Capacity: 5.0 cu ft", "Color: Blue, Size: S", "Color: Red, Size: M", "Color: Green, Size: L", "Color: Dark Blue, Size: 32", "Color: Light Blue, Size: 34", "Format: Hardcover, Language: English", "Format: Paperback, Language: Spanish", "Format: eBook, Language: English", "Format: Paperback, Language: French", "Size: Size 7, Color: Orange", "Material: Graphite, Color: Black", "Material: Aluminum, Color: Silver")
dep_store <- data.frame(category, item_name, item_id , brand, price, variation_id , variation_details)
For HTML, I used knitr to change dep_store to an html format using kable, then exported the file in an HTML format.
library(knitr)
html_dep_store <- kable(dep_store, format = "html", table.attr = "class='table'")
html_dep_store
category | item_name | item_id | brand | price | variation_id | variation_details |
---|---|---|---|---|---|---|
Electronics | Smartphone | 101 | TechBrand | 699.99 | 101-A | Color: Black, Storage: 64GB |
Electronics | Smartphone | 101 | TechBrand | 699.99 | 101-A | Color: White, Storage: 128GB |
Electronics | Laptop | 102 | CompuBrand | 1099.99 | 102-A | Color: Silver, Storage: 256GB |
Electronics | Laptop | 102 | CompuBrand | 1099.99 | 102-A | Color: Space Gray, Storage: 512GB |
Home | Refrigerator | 201 | HomeCool | 899.99 | 201-A | Stainless Steel, Capacity: 20 cu ft |
Home | Refrigerator | 201 | HomeCool | 899.99 | 201-B | Color: White, Capacity: 18 cu ft |
Home | Washing Machine | 202 | CleanTech | 499.99 | 202-A | Type: Front Load, Capacity: 4.5 cu ft |
Home | Washing Machine | 202 | CleanTech | 499.99 | 202-B | Type: Top Load, Capacity: 5.0 cu ft |
Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-A | Color: Blue, Size: S |
Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-B | Color: Red, Size: M |
Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-B | Color: Green, Size: L |
Clothing | Jeans | 302 | DenimWorks | 49.99 | 302-A | Color: Dark Blue, Size: 32 |
Clothing | Jeans | 302 | DenimWorks | 49.99 | 302-B | Color: Light Blue, Size: 34 |
Books | Fiction Novel | 401 |
|
14.99 | 401-A | Format: Hardcover, Language: English |
Books | Fiction Novel | 401 |
|
14.99 | 401-B | Format: Paperback, Language: Spanish |
Books | Non-Fiction Guide | 402 |
|
24.99 | 402-A | Format: eBook, Language: English |
Books | Non-Fiction Guide | 402 |
|
24.99 | 402-B | Format: Paperback, Language: French |
Sports Equipment | Basketball | 501 | SportsGear | 29.99 | 501-A | Size: Size 7, Color: Orange |
Sports Equipment | Tennis Racket | 502 | RacketPro | 89.99 | 502-A | Material: Graphite, Color: Black |
Sports Equipment | Tennis Racket | 502 | RacketPro | 89.99 | 502-B | Material: Aluminum, Color: Silver |
cat(html_dep_store, file = "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.html")
Pros of HTML: It is easy to use and knit to in R, it is widely used on websites, and it is effective for presenting data. Cons of HTML: Simple errors can mess up output/websites, and complex code need for different formats/styles.
For JSON, I used jsonlite to change dep_store to a JSON format using toJSON, then exported the file in a JSON format.
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
toJSON(x = dep_store, dataframe = 'rows', pretty = T)
## [
## {
## "category": "Electronics",
## "item_name": "Smartphone",
## "item_id": 101,
## "brand": "TechBrand",
## "price": "699.99",
## "variation_id": "101-A",
## "variation_details": "Color: Black, Storage: 64GB"
## },
## {
## "category": "Electronics",
## "item_name": "Smartphone",
## "item_id": 101,
## "brand": "TechBrand",
## "price": "699.99",
## "variation_id": "101-A",
## "variation_details": "Color: White, Storage: 128GB"
## },
## {
## "category": "Electronics",
## "item_name": "Laptop",
## "item_id": 102,
## "brand": "CompuBrand",
## "price": "1099.99",
## "variation_id": "102-A",
## "variation_details": "Color: Silver, Storage: 256GB"
## },
## {
## "category": "Electronics",
## "item_name": "Laptop",
## "item_id": 102,
## "brand": "CompuBrand",
## "price": "1099.99",
## "variation_id": "102-A",
## "variation_details": "Color: Space Gray, Storage: 512GB"
## },
## {
## "category": "Home",
## "item_name": "Refrigerator",
## "item_id": 201,
## "brand": "HomeCool",
## "price": "899.99",
## "variation_id": "201-A",
## "variation_details": "Stainless Steel, Capacity: 20 cu ft"
## },
## {
## "category": "Home",
## "item_name": "Refrigerator",
## "item_id": 201,
## "brand": "HomeCool",
## "price": "899.99",
## "variation_id": "201-B",
## "variation_details": "Color: White, Capacity: 18 cu ft"
## },
## {
## "category": "Home",
## "item_name": "Washing Machine",
## "item_id": 202,
## "brand": "CleanTech",
## "price": "499.99",
## "variation_id": "202-A",
## "variation_details": "Type: Front Load, Capacity: 4.5 cu ft"
## },
## {
## "category": "Home",
## "item_name": "Washing Machine",
## "item_id": 202,
## "brand": "CleanTech",
## "price": "499.99",
## "variation_id": "202-B",
## "variation_details": "Type: Top Load, Capacity: 5.0 cu ft"
## },
## {
## "category": "Clothing",
## "item_name": "T-Shirt",
## "item_id": 301,
## "brand": "FashionCo",
## "price": "19.99",
## "variation_id": "301-A",
## "variation_details": "Color: Blue, Size: S"
## },
## {
## "category": "Clothing",
## "item_name": "T-Shirt",
## "item_id": 301,
## "brand": "FashionCo",
## "price": "19.99",
## "variation_id": "301-B",
## "variation_details": "Color: Red, Size: M"
## },
## {
## "category": "Clothing",
## "item_name": "T-Shirt",
## "item_id": 301,
## "brand": "FashionCo",
## "price": "19.99",
## "variation_id": "301-B",
## "variation_details": "Color: Green, Size: L"
## },
## {
## "category": "Clothing",
## "item_name": "Jeans",
## "item_id": 302,
## "brand": "DenimWorks",
## "price": "49.99",
## "variation_id": "302-A",
## "variation_details": "Color: Dark Blue, Size: 32"
## },
## {
## "category": "Clothing",
## "item_name": "Jeans",
## "item_id": 302,
## "brand": "DenimWorks",
## "price": "49.99",
## "variation_id": "302-B",
## "variation_details": "Color: Light Blue, Size: 34"
## },
## {
## "category": "Books",
## "item_name": "Fiction Novel",
## "item_id": 401,
## "brand": "-",
## "price": "14.99",
## "variation_id": "401-A",
## "variation_details": "Format: Hardcover, Language: English"
## },
## {
## "category": "Books",
## "item_name": "Fiction Novel",
## "item_id": 401,
## "brand": "-",
## "price": "14.99",
## "variation_id": "401-B",
## "variation_details": "Format: Paperback, Language: Spanish"
## },
## {
## "category": "Books",
## "item_name": "Non-Fiction Guide",
## "item_id": 402,
## "brand": "-",
## "price": "24.99",
## "variation_id": "402-A",
## "variation_details": "Format: eBook, Language: English"
## },
## {
## "category": "Books",
## "item_name": "Non-Fiction Guide",
## "item_id": 402,
## "brand": "-",
## "price": "24.99",
## "variation_id": "402-B",
## "variation_details": "Format: Paperback, Language: French"
## },
## {
## "category": "Sports Equipment",
## "item_name": "Basketball",
## "item_id": 501,
## "brand": "SportsGear",
## "price": "29.99",
## "variation_id": "501-A",
## "variation_details": "Size: Size 7, Color: Orange"
## },
## {
## "category": "Sports Equipment",
## "item_name": "Tennis Racket",
## "item_id": 502,
## "brand": "RacketPro",
## "price": "89.99",
## "variation_id": "502-A",
## "variation_details": "Material: Graphite, Color: Black"
## },
## {
## "category": "Sports Equipment",
## "item_name": "Tennis Racket",
## "item_id": 502,
## "brand": "RacketPro",
## "price": "89.99",
## "variation_id": "502-B",
## "variation_details": "Material: Aluminum, Color: Silver"
## }
## ]
write_json(dep_store, "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.json", pretty = TRUE)
Pros of JSON: Easy to create and understand, readable for each observation, and commonly used for API calls. Cons of JSON: Not the best for large datasets or looking at variables.
For XML, I used xml, dplyr, and xml to change dep_store to a XML format using a loop and xmltree, then saved and exported the new xml data.
library(XML)
library(dplyr)
library(xml2)
xml_dep_store<- xmlTree("Department Store")
## Warning in xmlRoot.XMLInternalDocument(currentNodes[[1]]): empty XML document
for (i in 1:nrow(dep_store)) {
xml_dep_store$addNode(" ",
attrs = c(category = dep_store$category[i],
item_name = dep_store$item_name[i],
item_id = dep_store$item_id[i],
brand = dep_store$brand[i],
prive = dep_store$price[i],
variation_id = dep_store$variation_id[i],
variation_dtails = dep_store$variation_details[i]
))
}
cat(saveXML(xml_dep_store))
## <?xml version="1.0"?>
##
## <Department Store>
## < category="Electronics" item_name="Smartphone" item_id="101" brand="TechBrand" prive="699.99" variation_id="101-A" variation_dtails="Color: Black, Storage: 64GB"/>
## < category="Electronics" item_name="Smartphone" item_id="101" brand="TechBrand" prive="699.99" variation_id="101-A" variation_dtails="Color: White, Storage: 128GB"/>
## < category="Electronics" item_name="Laptop" item_id="102" brand="CompuBrand" prive="1099.99" variation_id="102-A" variation_dtails="Color: Silver, Storage: 256GB"/>
## < category="Electronics" item_name="Laptop" item_id="102" brand="CompuBrand" prive="1099.99" variation_id="102-A" variation_dtails="Color: Space Gray, Storage: 512GB"/>
## < category="Home" item_name="Refrigerator" item_id="201" brand="HomeCool" prive="899.99" variation_id="201-A" variation_dtails="Stainless Steel, Capacity: 20 cu ft"/>
## < category="Home" item_name="Refrigerator" item_id="201" brand="HomeCool" prive="899.99" variation_id="201-B" variation_dtails="Color: White, Capacity: 18 cu ft"/>
## < category="Home" item_name="Washing Machine" item_id="202" brand="CleanTech" prive="499.99" variation_id="202-A" variation_dtails="Type: Front Load, Capacity: 4.5 cu ft"/>
## < category="Home" item_name="Washing Machine" item_id="202" brand="CleanTech" prive="499.99" variation_id="202-B" variation_dtails="Type: Top Load, Capacity: 5.0 cu ft"/>
## < category="Clothing" item_name="T-Shirt" item_id="301" brand="FashionCo" prive="19.99" variation_id="301-A" variation_dtails="Color: Blue, Size: S"/>
## < category="Clothing" item_name="T-Shirt" item_id="301" brand="FashionCo" prive="19.99" variation_id="301-B" variation_dtails="Color: Red, Size: M"/>
## < category="Clothing" item_name="T-Shirt" item_id="301" brand="FashionCo" prive="19.99" variation_id="301-B" variation_dtails="Color: Green, Size: L"/>
## < category="Clothing" item_name="Jeans" item_id="302" brand="DenimWorks" prive="49.99" variation_id="302-A" variation_dtails="Color: Dark Blue, Size: 32"/>
## < category="Clothing" item_name="Jeans" item_id="302" brand="DenimWorks" prive="49.99" variation_id="302-B" variation_dtails="Color: Light Blue, Size: 34"/>
## < category="Books" item_name="Fiction Novel" item_id="401" brand="-" prive="14.99" variation_id="401-A" variation_dtails="Format: Hardcover, Language: English"/>
## < category="Books" item_name="Fiction Novel" item_id="401" brand="-" prive="14.99" variation_id="401-B" variation_dtails="Format: Paperback, Language: Spanish"/>
## < category="Books" item_name="Non-Fiction Guide" item_id="402" brand="-" prive="24.99" variation_id="402-A" variation_dtails="Format: eBook, Language: English"/>
## < category="Books" item_name="Non-Fiction Guide" item_id="402" brand="-" prive="24.99" variation_id="402-B" variation_dtails="Format: Paperback, Language: French"/>
## < category="Sports Equipment" item_name="Basketball" item_id="501" brand="SportsGear" prive="29.99" variation_id="501-A" variation_dtails="Size: Size 7, Color: Orange"/>
## < category="Sports Equipment" item_name="Tennis Racket" item_id="502" brand="RacketPro" prive="89.99" variation_id="502-A" variation_dtails="Material: Graphite, Color: Black"/>
## < category="Sports Equipment" item_name="Tennis Racket" item_id="502" brand="RacketPro" prive="89.99" variation_id="502-B" variation_dtails="Material: Aluminum, Color: Silver"/>
## </Department Store>
saveXML(xml_dep_store, file="C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.xml")
## [1] "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.xml"
Pros of XML: Each to understand by observation and used as a tool for storing and transporting data. Cons of XML: Difficult to convert dataframe to xml in R and requires more than one library to create desired XML format in R.
For PARQUET, I exported the data into a parquet file using arrow and write_parquet.
library(arrow)
##
## Attaching package: 'arrow'
## The following object is masked from 'package:lubridate':
##
## duration
## The following object is masked from 'package:utils':
##
## timestamp
write_parquet(dep_store, "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\dep_data.parquet")
Pros of PARQUET: Faster to read in data, easy to work with, and smaller in size compared to HTML. Cons of PARQUET: Difficult to understand raw file
Each data type: JSON, HTML, XML, and PARQUET is all useful for different purposes since they all have pros and cons. In the future, I would import the data in these different formats then try working with them in R.