Data 607: Assignment 7

Introduction

For this week’s assignment, we experimented with different data types - specifically HTML, JSON, XML, and Parquet. I chose to build the dataframe from CUNYMart in R, and then convert my dataframe to the other datatypes to export. Then I listed some pros and cons of each data type.

Build CUNYMart dataframe in R

For this step I created vectors for Category, Item Name, Item ID, Brand, Price, Variation ID, and Variation Details from the information provide in the PDF. Then I combined all the vectors into one dataframe called dep_store.

cate <- rep(c("Electronics","Home"), each = 4)
cate2 <- rep(c("Clothing"), times = 5)
cate3 <- rep(c("Books"), times = 4)
cate4 <- rep(c("Sports Equipment"), times = 3)
category <- c(cate, cate2, cate3, cate4)

item_name <- c("Smartphone", "Smartphone", "Laptop", "Laptop", "Refrigerator", "Refrigerator", "Washing Machine", "Washing Machine", "T-Shirt", "T-Shirt", "T-Shirt", "Jeans", "Jeans", "Fiction Novel","Fiction Novel", "Non-Fiction Guide", "Non-Fiction Guide", "Basketball", "Tennis Racket", "Tennis Racket")

item_id <- c(101, 101, 102, 102, 201, 201, 202, 202, 301, 301, 301, 302, 302, 401, 401, 402, 402, 501, 502, 502)

brand <- c("TechBrand", "TechBrand", "CompuBrand", "CompuBrand", "HomeCool", "HomeCool", "CleanTech", "CleanTech", "FashionCo", "FashionCo", "FashionCo", "DenimWorks", "DenimWorks", "-", "-", "-", "-", "SportsGear", "RacketPro", "RacketPro")

price <- c("699.99", "699.99", "1099.99", "1099.99", "899.99", "899.99", "499.99", "499.99", "19.99", "19.99", "19.99", "49.99", "49.99", "14.99", "14.99", "24.99", "24.99", "29.99", "89.99", "89.99")

variation_id <- c("101-A", "101-A", "102-A", "102-A", "201-A", "201-B", "202-A", "202-B", "301-A", "301-B", "301-B", "302-A", "302-B", "401-A", "401-B", "402-A", "402-B", "501-A", "502-A", "502-B")

variation_details <- c("Color: Black, Storage: 64GB", "Color: White, Storage: 128GB", "Color: Silver, Storage: 256GB", "Color: Space Gray, Storage: 512GB", "Stainless Steel, Capacity: 20 cu ft", "Color: White, Capacity: 18 cu ft", "Type: Front Load, Capacity: 4.5 cu ft", "Type: Top Load, Capacity: 5.0 cu ft", "Color: Blue, Size: S", "Color: Red, Size: M", "Color: Green, Size: L", "Color: Dark Blue, Size: 32", "Color: Light Blue, Size: 34", "Format: Hardcover, Language: English", "Format: Paperback, Language: Spanish", "Format: eBook, Language: English", "Format: Paperback, Language: French", "Size: Size 7, Color: Orange", "Material: Graphite, Color: Black", "Material: Aluminum, Color: Silver")

dep_store <- data.frame(category, item_name, item_id , brand, price, variation_id , variation_details)

HTML

For HTML, I used knitr to change dep_store to an html format using kable, then exported the file in an HTML format.

library(knitr)


html_dep_store <- kable(dep_store, format = "html", table.attr = "class='table'")

html_dep_store

category	item_name	item_id	brand	price	variation_id	variation_details
Electronics	Smartphone	101	TechBrand	699.99	101-A	Color: Black, Storage: 64GB
Electronics	Smartphone	101	TechBrand	699.99	101-A	Color: White, Storage: 128GB
Electronics	Laptop	102	CompuBrand	1099.99	102-A	Color: Silver, Storage: 256GB
Electronics	Laptop	102	CompuBrand	1099.99	102-A	Color: Space Gray, Storage: 512GB
Home	Refrigerator	201	HomeCool	899.99	201-A	Stainless Steel, Capacity: 20 cu ft
Home	Refrigerator	201	HomeCool	899.99	201-B	Color: White, Capacity: 18 cu ft
Home	Washing Machine	202	CleanTech	499.99	202-A	Type: Front Load, Capacity: 4.5 cu ft
Home	Washing Machine	202	CleanTech	499.99	202-B	Type: Top Load, Capacity: 5.0 cu ft
Clothing	T-Shirt	301	FashionCo	19.99	301-A	Color: Blue, Size: S
Clothing	T-Shirt	301	FashionCo	19.99	301-B	Color: Red, Size: M
Clothing	T-Shirt	301	FashionCo	19.99	301-B	Color: Green, Size: L
Clothing	Jeans	302	DenimWorks	49.99	302-A	Color: Dark Blue, Size: 32
Clothing	Jeans	302	DenimWorks	49.99	302-B	Color: Light Blue, Size: 34
Books	Fiction Novel	401		14.99	401-A	Format: Hardcover, Language: English
Books	Fiction Novel	401		14.99	401-B	Format: Paperback, Language: Spanish
Books	Non-Fiction Guide	402		24.99	402-A	Format: eBook, Language: English
Books	Non-Fiction Guide	402		24.99	402-B	Format: Paperback, Language: French
Sports Equipment	Basketball	501	SportsGear	29.99	501-A	Size: Size 7, Color: Orange
Sports Equipment	Tennis Racket	502	RacketPro	89.99	502-A	Material: Graphite, Color: Black
Sports Equipment	Tennis Racket	502	RacketPro	89.99	502-B	Material: Aluminum, Color: Silver

cat(html_dep_store, file = "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.html")

Pros of HTML: It is easy to use and knit to in R, it is widely used on websites, and it is effective for presenting data. Cons of HTML: Simple errors can mess up output/websites, and complex code need for different formats/styles.

JSON

For JSON, I used jsonlite to change dep_store to a JSON format using toJSON, then exported the file in a JSON format.

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

toJSON(x = dep_store, dataframe = 'rows', pretty = T)

## [
##   {
##     "category": "Electronics",
##     "item_name": "Smartphone",
##     "item_id": 101,
##     "brand": "TechBrand",
##     "price": "699.99",
##     "variation_id": "101-A",
##     "variation_details": "Color: Black, Storage: 64GB"
##   },
##   {
##     "category": "Electronics",
##     "item_name": "Smartphone",
##     "item_id": 101,
##     "brand": "TechBrand",
##     "price": "699.99",
##     "variation_id": "101-A",
##     "variation_details": "Color: White, Storage: 128GB"
##   },
##   {
##     "category": "Electronics",
##     "item_name": "Laptop",
##     "item_id": 102,
##     "brand": "CompuBrand",
##     "price": "1099.99",
##     "variation_id": "102-A",
##     "variation_details": "Color: Silver, Storage: 256GB"
##   },
##   {
##     "category": "Electronics",
##     "item_name": "Laptop",
##     "item_id": 102,
##     "brand": "CompuBrand",
##     "price": "1099.99",
##     "variation_id": "102-A",
##     "variation_details": "Color: Space Gray, Storage: 512GB"
##   },
##   {
##     "category": "Home",
##     "item_name": "Refrigerator",
##     "item_id": 201,
##     "brand": "HomeCool",
##     "price": "899.99",
##     "variation_id": "201-A",
##     "variation_details": "Stainless Steel, Capacity: 20 cu ft"
##   },
##   {
##     "category": "Home",
##     "item_name": "Refrigerator",
##     "item_id": 201,
##     "brand": "HomeCool",
##     "price": "899.99",
##     "variation_id": "201-B",
##     "variation_details": "Color: White, Capacity: 18 cu ft"
##   },
##   {
##     "category": "Home",
##     "item_name": "Washing Machine",
##     "item_id": 202,
##     "brand": "CleanTech",
##     "price": "499.99",
##     "variation_id": "202-A",
##     "variation_details": "Type: Front Load, Capacity: 4.5 cu ft"
##   },
##   {
##     "category": "Home",
##     "item_name": "Washing Machine",
##     "item_id": 202,
##     "brand": "CleanTech",
##     "price": "499.99",
##     "variation_id": "202-B",
##     "variation_details": "Type: Top Load, Capacity: 5.0 cu ft"
##   },
##   {
##     "category": "Clothing",
##     "item_name": "T-Shirt",
##     "item_id": 301,
##     "brand": "FashionCo",
##     "price": "19.99",
##     "variation_id": "301-A",
##     "variation_details": "Color: Blue, Size: S"
##   },
##   {
##     "category": "Clothing",
##     "item_name": "T-Shirt",
##     "item_id": 301,
##     "brand": "FashionCo",
##     "price": "19.99",
##     "variation_id": "301-B",
##     "variation_details": "Color: Red, Size: M"
##   },
##   {
##     "category": "Clothing",
##     "item_name": "T-Shirt",
##     "item_id": 301,
##     "brand": "FashionCo",
##     "price": "19.99",
##     "variation_id": "301-B",
##     "variation_details": "Color: Green, Size: L"
##   },
##   {
##     "category": "Clothing",
##     "item_name": "Jeans",
##     "item_id": 302,
##     "brand": "DenimWorks",
##     "price": "49.99",
##     "variation_id": "302-A",
##     "variation_details": "Color: Dark Blue, Size: 32"
##   },
##   {
##     "category": "Clothing",
##     "item_name": "Jeans",
##     "item_id": 302,
##     "brand": "DenimWorks",
##     "price": "49.99",
##     "variation_id": "302-B",
##     "variation_details": "Color: Light Blue, Size: 34"
##   },
##   {
##     "category": "Books",
##     "item_name": "Fiction Novel",
##     "item_id": 401,
##     "brand": "-",
##     "price": "14.99",
##     "variation_id": "401-A",
##     "variation_details": "Format: Hardcover, Language: English"
##   },
##   {
##     "category": "Books",
##     "item_name": "Fiction Novel",
##     "item_id": 401,
##     "brand": "-",
##     "price": "14.99",
##     "variation_id": "401-B",
##     "variation_details": "Format: Paperback, Language: Spanish"
##   },
##   {
##     "category": "Books",
##     "item_name": "Non-Fiction Guide",
##     "item_id": 402,
##     "brand": "-",
##     "price": "24.99",
##     "variation_id": "402-A",
##     "variation_details": "Format: eBook, Language: English"
##   },
##   {
##     "category": "Books",
##     "item_name": "Non-Fiction Guide",
##     "item_id": 402,
##     "brand": "-",
##     "price": "24.99",
##     "variation_id": "402-B",
##     "variation_details": "Format: Paperback, Language: French"
##   },
##   {
##     "category": "Sports Equipment",
##     "item_name": "Basketball",
##     "item_id": 501,
##     "brand": "SportsGear",
##     "price": "29.99",
##     "variation_id": "501-A",
##     "variation_details": "Size: Size 7, Color: Orange"
##   },
##   {
##     "category": "Sports Equipment",
##     "item_name": "Tennis Racket",
##     "item_id": 502,
##     "brand": "RacketPro",
##     "price": "89.99",
##     "variation_id": "502-A",
##     "variation_details": "Material: Graphite, Color: Black"
##   },
##   {
##     "category": "Sports Equipment",
##     "item_name": "Tennis Racket",
##     "item_id": 502,
##     "brand": "RacketPro",
##     "price": "89.99",
##     "variation_id": "502-B",
##     "variation_details": "Material: Aluminum, Color: Silver"
##   }
## ]

write_json(dep_store, "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.json", pretty = TRUE)

Pros of JSON: Easy to create and understand, readable for each observation, and commonly used for API calls. Cons of JSON: Not the best for large datasets or looking at variables.

XML

For XML, I used xml, dplyr, and xml to change dep_store to a XML format using a loop and xmltree, then saved and exported the new xml data.

library(XML)
library(dplyr)
library(xml2)


xml_dep_store<- xmlTree("Department Store")

## Warning in xmlRoot.XMLInternalDocument(currentNodes[[1]]): empty XML document

for (i in 1:nrow(dep_store)) {
  xml_dep_store$addNode(" ",
                   attrs = c(category = dep_store$category[i],
                            item_name = dep_store$item_name[i],
                            item_id = dep_store$item_id[i],
                            brand = dep_store$brand[i],
                            prive = dep_store$price[i],
                            variation_id = dep_store$variation_id[i],
                            variation_dtails = dep_store$variation_details[i]
                             ))
}


cat(saveXML(xml_dep_store))

## <?xml version="1.0"?>
## 
## <Department Store>
##   <  category="Electronics" item_name="Smartphone" item_id="101" brand="TechBrand" prive="699.99" variation_id="101-A" variation_dtails="Color: Black, Storage: 64GB"/>
##   <  category="Electronics" item_name="Smartphone" item_id="101" brand="TechBrand" prive="699.99" variation_id="101-A" variation_dtails="Color: White, Storage: 128GB"/>
##   <  category="Electronics" item_name="Laptop" item_id="102" brand="CompuBrand" prive="1099.99" variation_id="102-A" variation_dtails="Color: Silver, Storage: 256GB"/>
##   <  category="Electronics" item_name="Laptop" item_id="102" brand="CompuBrand" prive="1099.99" variation_id="102-A" variation_dtails="Color: Space Gray, Storage: 512GB"/>
##   <  category="Home" item_name="Refrigerator" item_id="201" brand="HomeCool" prive="899.99" variation_id="201-A" variation_dtails="Stainless Steel, Capacity: 20 cu ft"/>
##   <  category="Home" item_name="Refrigerator" item_id="201" brand="HomeCool" prive="899.99" variation_id="201-B" variation_dtails="Color: White, Capacity: 18 cu ft"/>
##   <  category="Home" item_name="Washing Machine" item_id="202" brand="CleanTech" prive="499.99" variation_id="202-A" variation_dtails="Type: Front Load, Capacity: 4.5 cu ft"/>
##   <  category="Home" item_name="Washing Machine" item_id="202" brand="CleanTech" prive="499.99" variation_id="202-B" variation_dtails="Type: Top Load, Capacity: 5.0 cu ft"/>
##   <  category="Clothing" item_name="T-Shirt" item_id="301" brand="FashionCo" prive="19.99" variation_id="301-A" variation_dtails="Color: Blue, Size: S"/>
##   <  category="Clothing" item_name="T-Shirt" item_id="301" brand="FashionCo" prive="19.99" variation_id="301-B" variation_dtails="Color: Red, Size: M"/>
##   <  category="Clothing" item_name="T-Shirt" item_id="301" brand="FashionCo" prive="19.99" variation_id="301-B" variation_dtails="Color: Green, Size: L"/>
##   <  category="Clothing" item_name="Jeans" item_id="302" brand="DenimWorks" prive="49.99" variation_id="302-A" variation_dtails="Color: Dark Blue, Size: 32"/>
##   <  category="Clothing" item_name="Jeans" item_id="302" brand="DenimWorks" prive="49.99" variation_id="302-B" variation_dtails="Color: Light Blue, Size: 34"/>
##   <  category="Books" item_name="Fiction Novel" item_id="401" brand="-" prive="14.99" variation_id="401-A" variation_dtails="Format: Hardcover, Language: English"/>
##   <  category="Books" item_name="Fiction Novel" item_id="401" brand="-" prive="14.99" variation_id="401-B" variation_dtails="Format: Paperback, Language: Spanish"/>
##   <  category="Books" item_name="Non-Fiction Guide" item_id="402" brand="-" prive="24.99" variation_id="402-A" variation_dtails="Format: eBook, Language: English"/>
##   <  category="Books" item_name="Non-Fiction Guide" item_id="402" brand="-" prive="24.99" variation_id="402-B" variation_dtails="Format: Paperback, Language: French"/>
##   <  category="Sports Equipment" item_name="Basketball" item_id="501" brand="SportsGear" prive="29.99" variation_id="501-A" variation_dtails="Size: Size 7, Color: Orange"/>
##   <  category="Sports Equipment" item_name="Tennis Racket" item_id="502" brand="RacketPro" prive="89.99" variation_id="502-A" variation_dtails="Material: Graphite, Color: Black"/>
##   <  category="Sports Equipment" item_name="Tennis Racket" item_id="502" brand="RacketPro" prive="89.99" variation_id="502-B" variation_dtails="Material: Aluminum, Color: Silver"/>
## </Department Store>

saveXML(xml_dep_store, file="C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.xml")

## [1] "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\data_dep_store.xml"

Pros of XML: Each to understand by observation and used as a tool for storing and transporting data. Cons of XML: Difficult to convert dataframe to xml in R and requires more than one library to create desired XML format in R.

PARQUET

For PARQUET, I exported the data into a parquet file using arrow and write_parquet.

library(arrow)

## 
## Attaching package: 'arrow'

## The following object is masked from 'package:lubridate':
## 
##     duration

## The following object is masked from 'package:utils':
## 
##     timestamp

write_parquet(dep_store, "C:\\Users\\nakes\\OneDrive\\Desktop\\607\\dep_data.parquet")

Pros of PARQUET: Faster to read in data, easy to work with, and smaller in size compared to HTML. Cons of PARQUET: Difficult to understand raw file

Conclusion

Each data type: JSON, HTML, XML, and PARQUET is all useful for different purposes since they all have pros and cons. In the future, I would import the data in these different formats then try working with them in R.