For this assignment I had to prepare data that was provided from the CUNYMart inventory, which is located at located at 123 Example Street,Anytown, USA. To prepare the data for analysis, I had to turn the given data into JSon, HTML, XML, and Parquet files and then read them in R.
Loading packages
library(tidyverse)
library("arrow")
library(XML)
library(RCurl)
library(pdftools)
library(data.table)
library(jsonlite)
I created the data frame in R from the data frame we were provided for the assignment on CUNYMart’s inventory.
DF_prep<- data.frame(
Category = c("Electronics", "Electronics", "Electronics", "Electronics",
"Home Appliances", "Home Appliances", "Home Appliances", "Home Appliances",
"Clothing", "Clothing", "Clothing", "Clothing", "Clothing",
"Books", "Books", "Books", "Books",
"Sports Equipment", "Sports Equipment", "Sports Equipment"),
Item_Name = c("Smartphone", "Smartphone", "Laptop", "Laptop",
"Refrigerator", "Refrigerator", "Washing Machine", "Washing Machine",
"T-Shirt", "T-Shirt", "T-Shirt", "Jeans", "Jeans",
"Fiction Novel", "Fiction Novel", "Non-Fiction Guide", "Non-Fiction Guide",
"Basketball", "Tennis Racket", "Tennis Racket"),
Item_ID = c(101, 101, 102, 102,
201, 201, 202, 202,
301, 301, 301, 302, 302,
401, 401, 402, 402,
501, 502, 502),
Brand = c("TechBrand", "TechBrand", "CompuBrand", "CompuBrand",
"HomeCool", "HomeCool", "CleanTech", "CleanTech",
"FashionCo", "FashionCo", "FashionCo", "DenimWorks", "DenimWorks",
NA, NA, NA, NA,
"SportsGear", "RacketPro", "RacketPro"),
Price = c(699.99, 699.99, 1099.99, 1099.99,
899.99, 899.99, 499.99, 499.99,
19.99, 19.99, 19.99, 49.99, 49.99,
14.99, 14.99, 24.99, 24.99,
29.99, 89.99, 89.99),
Variation_ID = c("101-A", "101-B", "102-A", "102-B",
"201-A", "201-B", "202-A", "202-B",
"301-A", "301-B", "301-C", "302-A", "302-B",
"401-A", "401-B", "402-A", "402-B",
"501-A", "502-A", "502-B"),
Variation_Details = c("Color: Black, Storage: 64GB", "Color: White, Storage: 128GB",
"Color: Silver, Storage: 256GB", "Color: Space Gray, Storage: 512GB",
"Color: Stainless Steel, Capacity: 20 cu ft", "Color: White, Capacity: 18 cu ft",
"Type: Front Load, Capacity: 4.5 cu ft", "Type: Top Load, Capacity: 5.0 cu ft",
"Color: Blue, Size: S", "Color: Red, Size: M", "Color: Green, Size: L",
"Color: Dark Blue, Size: 32", "Color: Light Blue, Size: 34",
"Format: Hardcover, Language: English", "Format: Paperback, Language: Spanish",
"Format: eBook, Language: English", "Format: Paperback, Language: French",
"Size: Size 7, Color: Orange", "Material: Graphite, Color: Black", "Material: Aluminum, Color: Silver")
)
for each cell follow up the name of each column (ex.”th” Category “th”), for the following rows I used “td” as the tag for the data that belongs in the specific row and column, then to close the table I placed another “table” tag,and finally ended the file with “html”. I uploaded the html file into github, to obtain the raw data. I used getURL function to load the raw html file into R.**
html_DF<-getURL('https://raw.githubusercontent.com/Andreina-A/Assignment_7_Data607/refs/heads/main/dataframe.html')
cat(html_DF) # The view of the HTML code
## <html>
## <table>
## <tr>
## <th>Category</th>
## <th>Item Name</th>
## <th>Item ID</th>
## <th>Brand</th>
## <th>Price</th>
## <th>Variation ID</th>
## <th>Variation Details</th>
## </tr>
## <tr>
## <td>Electronics</td>
## <td>Smartphone</td>
## <td>101</td>
## <td>TechBrand</td>
## <td>699.99</td>
## <td>101-A</td>
## <td>Color: Black, Storage: 64GB</td>
## </tr>
## <tr>
## <td>Electronics</td>
## <td>Smartphone</td>
## <td>101</td>
## <td>TechBrand</td>
## <td>699.99</td>
## <td>101-B</td>
## <td>Color: White, Storage: 128GB</td>
## </tr>
## <tr>
## <td>Electronics</td>
## <td>Laptop</td>
## <td>102</td>
## <td>CompuBrand</td>
## <td>1099.99</td>
## <td>102-A</td>
## <td>Color: Silver, Storage: 256GB</td>
## </tr>
## <tr>
## <td>Electronics</td>
## <td>Laptop</td>
## <td>102</td>
## <td>CompuBrand</td>
## <td>1099.99</td>
## <td>102-B</td>
## <td>Color: Space Gray, Storage: 512GB</td>
## </tr>
## <tr>
## <td>Home Appliances</td>
## <td>Refrigerator</td>
## <td>201</td>
## <td>HomeCool</td>
## <td>899.99</td>
## <td>201-A</td>
## <td>Color: Stainless Steel, Capacity: 20 cu ft</td>
## </tr>
## <tr>
## <td>Home Appliances</td>
## <td>Refrigerator</td>
## <td>201</td>
## <td>HomeCool</td>
## <td>899.99</td>
## <td>201-B</td>
## <td>Color: White, Capacity: 18 cu ft</td>
## </tr>
## <tr>
## <td>Home Appliances</td>
## <td>Washing Machine</td>
## <td>202</td>
## <td>CleanTech</td>
## <td>499.99</td>
## <td>202-A</td>
## <td>Type: Front Load, Capacity: 4.5 cu ft</td>
## </tr>
## <tr>
## <td>Home Appliances</td>
## <td>Washing Machine</td>
## <td>202</td>
## <td>CleanTech</td>
## <td>499.99</td>
## <td>202-B</td>
## <td>Type: Top Load, Capacity: 5.0 cu ft</td>
## </tr>
## <tr>
## <td>Clothing</td>
## <td>T-Shirt</td>
## <td>301</td>
## <td>FashionCo</td>
## <td>19.99</td>
## <td>301-A</td>
## <td>Color: Blue, Size: S</td>
## </tr>
## <tr>
## <td>Clothing</td>
## <td>T-Shirt</td>
## <td>301</td>
## <td>FashionCo</td>
## <td>19.99</td>
## <td>301-B</td>
## <td>Color: Red, Size: M</td>
## </tr>
## <tr>
## <td>Clothing</td>
## <td>T-Shirt</td>
## <td>301</td>
## <td>FashionCo</td>
## <td>19.99</td>
## <td>301-C</td>
## <td>Color: Green, Size: L</td>
## </tr>
## <tr>
## <td>Clothing</td>
## <td>Jeans</td>
## <td>302</td>
## <td>DenimWorks</td>
## <td>49.99</td>
## <td>302-A</td>
## <td>Color: Dark Blue, Size: 32</td>
## </tr>
## <tr>
## <td>Clothing</td>
## <td>Jeans</td>
## <td>302</td>
## <td>DenimWorks</td>
## <td>49.99</td>
## <td>302-B</td>
## <td>Color: Light Blue, Size: 34</td>
## </tr>
## <tr>
## <td>Books</td>
## <td>Fiction Novel</td>
## <td>401</td>
## <td>-</td>
## <td>14.99</td>
## <td>401-A</td>
## <td>Format: Hardcover, Language: English</td>
## </tr>
## <tr>
## <td>Books</td>
## <td>Fiction Novel</td>
## <td>401</td>
## <td>-</td>
## <td>14.99</td>
## <td>401-B</td>
## <td>Format: Paperback, Language: Spanish</td>
## </tr>
## <tr>
## <td>Books</td>
## <td>Non-Fiction Guide</td>
## <td>402</td>
## <td>-</td>
## <td>24.99</td>
## <td>402-A</td>
## <td>Format: eBook, Language: English</td>
## </tr>
## <tr>
## <td>Books</td>
## <td>Non-Fiction Guide</td>
## <td>402</td>
## <td>-</td>
## <td>24.99</td>
## <td>402-B</td>
## <td>Format: Paperback, Language: French</td>
## </tr>
## <tr>
## <td>Sports Equipment</td>
## <td>Basketball</td>
## <td>501</td>
## <td>SportsGear</td>
## <td>29.99</td>
## <td>501-A</td>
## <td>Size: Size 7, Color: Orange</td>
## </tr>
## <tr>
## <td>Sports Equipment</td>
## <td>Tennis Racket</td>
## <td>502</td>
## <td>RacketPro</td>
## <td>89.99</td>
## <td>502-A</td>
## <td>Material: Graphite, Color: Black</td>
## </tr>
## <tr>
## <td>Sports Equipment</td>
## <td>Tennis Racket</td>
## <td>502</td>
## <td>RacketPro</td>
## <td>89.99</td>
## <td>502-B</td>
## <td>Material: Aluminum, Color: Silver</td>
## </tr>
## </table>
##
##
##
## </html>
To extract the data from the HTML file, I used readHTMLTable( a function from the XML package).
html_df <-readHTMLTable(html_DF,header=TRUE)
class(html_df)# Used calls function to see is the data extracted came out as a data frame, which it came out as a list instead.
## [1] "list"
To find the data frame, checked the class of the NULL object.
class(html_df$`NULL`)# I used NULL because it was the only generated option after I enter $
## [1] "data.frame"
I created the data frame variable with by extracted the NULL object from the HTML. Now the data is ready to use in vairbale HTML_dataframe
html_dataframe<-html_df$`NULL`
html_dataframe
## Category Item Name Item ID Brand Price Variation ID
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9 Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10 Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11 Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12 Clothing Jeans 302 DenimWorks 49.99 302-A
## 13 Clothing Jeans 302 DenimWorks 49.99 302-B
## 14 Books Fiction Novel 401 - 14.99 401-A
## 15 Books Fiction Novel 401 - 14.99 401-B
## 16 Books Non-Fiction Guide 402 - 24.99 402-A
## 17 Books Non-Fiction Guide 402 - 24.99 402-B
## 18 Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## Variation Details
## 1 Color: Black, Storage: 64GB
## 2 Color: White, Storage: 128GB
## 3 Color: Silver, Storage: 256GB
## 4 Color: Space Gray, Storage: 512GB
## 5 Color: Stainless Steel, Capacity: 20 cu ft
## 6 Color: White, Capacity: 18 cu ft
## 7 Type: Front Load, Capacity: 4.5 cu ft
## 8 Type: Top Load, Capacity: 5.0 cu ft
## 9 Color: Blue, Size: S
## 10 Color: Red, Size: M
## 11 Color: Green, Size: L
## 12 Color: Dark Blue, Size: 32
## 13 Color: Light Blue, Size: 34
## 14 Format: Hardcover, Language: English
## 15 Format: Paperback, Language: Spanish
## 16 Format: eBook, Language: English
## 17 Format: Paperback, Language: French
## 18 Size: Size 7, Color: Orange
## 19 Material: Graphite, Color: Black
## 20 Material: Aluminum, Color: Silver
XML was manually created similarly to HTML, just with a difference in tags which are more straightforward, the column title themselves were used in the tags.
DF_XML<-getURL('https://raw.githubusercontent.com/Andreina-A/Assignment_7_Data607/refs/heads/main/Data_xml.xml')
cat(DF_XML)#XML code
## <?xml version="1.0" encoding="UTF-8" ?>
## <Items>
## <Item>
## <Category>Electronics</Category>
## <ItemName>Smartphone</ItemName>
## <ItemID>101</ItemID>
## <Brand>TechBrand</Brand>
## <Price>699.99</Price>
## <VariationID>101-A</VariationID>
## <VariationDetails>
## <Color>Black</Color>
## <Storage>64GB</Storage>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Electronics</Category>
## <ItemName>Smartphone</ItemName>
## <ItemID>101</ItemID>
## <Brand>TechBrand</Brand>
## <Price>699.99</Price>
## <VariationID>101-B</VariationID>
## <VariationDetails>
## <Color>White</Color>
## <Storage>128GB</Storage>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Electronics</Category>
## <ItemName>Laptop</ItemName>
## <ItemID>102</ItemID>
## <Brand>CompuBrand</Brand>
## <Price>1099.99</Price>
## <VariationID>102-A</VariationID>
## <VariationDetails>
## <Color>Silver</Color>
## <Storage>256GB</Storage>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Electronics</Category>
## <ItemName>Laptop</ItemName>
## <ItemID>102</ItemID>
## <Brand>CompuBrand</Brand>
## <Price>1099.99</Price>
## <VariationID>102-B</VariationID>
## <VariationDetails>
## <Color>Space Gray</Color>
## <Storage>512GB</Storage>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Home Appliances</Category>
## <ItemName>Refrigerator</ItemName>
## <ItemID>201</ItemID>
## <Brand>HomeCool</Brand>
## <Price>899.99</Price>
## <VariationID>201-A</VariationID>
## <VariationDetails>
## <Color>Stainless Steel</Color>
## <Capacity>20 cu ft</Capacity>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Home Appliances</Category>
## <ItemName>Refrigerator</ItemName>
## <ItemID>201</ItemID>
## <Brand>HomeCool</Brand>
## <Price>899.99</Price>
## <VariationID>201-B</VariationID>
## <VariationDetails>
## <Color>White</Color>
## <Capacity>18 cu ft</Capacity>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Home Appliances</Category>
## <ItemName>Washing Machine</ItemName>
## <ItemID>202</ItemID>
## <Brand>CleanTech</Brand>
## <Price>499.99</Price>
## <VariationID>202-A</VariationID>
## <VariationDetails>
## <Type>Front Load</Type>
## <Capacity>4.5 cu ft</Capacity>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Home Appliances</Category>
## <ItemName>Washing Machine</ItemName>
## <ItemID>202</ItemID>
## <Brand>CleanTech</Brand>
## <Price>499.99</Price>
## <VariationID>202-B</VariationID>
## <VariationDetails>
## <Type>Top Load</Type>
## <Capacity>5.0 cu ft</Capacity>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Clothing</Category>
## <ItemName>T-Shirt</ItemName>
## <ItemID>301</ItemID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <VariationID>301-A</VariationID>
## <VariationDetails>
## <Color>Blue</Color>
## <Size>S</Size>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Clothing</Category>
## <ItemName>T-Shirt</ItemName>
## <ItemID>301</ItemID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <VariationID>301-B</VariationID>
## <VariationDetails>
## <Color>Red</Color>
## <Size>M</Size>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Clothing</Category>
## <ItemName>T-Shirt</ItemName>
## <ItemID>301</ItemID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <VariationID>301-C</VariationID>
## <VariationDetails>
## <Color>Green</Color>
## <Size>L</Size>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Clothing</Category>
## <ItemName>Jeans</ItemName>
## <ItemID>302</ItemID>
## <Brand>DenimWorks</Brand>
## <Price>49.99</Price>
## <VariationID>302-A</VariationID>
## <VariationDetails>
## <Color>Dark Blue</Color>
## <Size>32</Size>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Clothing</Category>
## <ItemName>Jeans</ItemName>
## <ItemID>302</ItemID>
## <Brand>DenimWorks</Brand>
## <Price>49.99</Price>
## <VariationID>302-B</VariationID>
## <VariationDetails>
## <Color>Light Blue</Color>
## <Size>34</Size>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Books</Category>
## <ItemName>Fiction Novel</ItemName>
## <ItemID>401</ItemID>
## <Brand>-</Brand>
## <Price>14.99</Price>
## <VariationID>401-A</VariationID>
## <VariationDetails>
## <Format>Hardcover</Format>
## <Language>English</Language>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Books</Category>
## <ItemName>Fiction Novel</ItemName>
## <ItemID>401</ItemID>
## <Brand>-</Brand>
## <Price>14.99</Price>
## <VariationID>401-B</VariationID>
## <VariationDetails>
## <Format>Paperback</Format>
## <Language>Spanish</Language>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Books</Category>
## <ItemName>Non-Fiction Guide</ItemName>
## <ItemID>402</ItemID>
## <Brand>-</Brand>
## <Price>24.99</Price>
## <VariationID>402-A</VariationID>
## <VariationDetails>
## <Format>eBook</Format>
## <Language>English</Language>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Books</Category>
## <ItemName>Non-Fiction Guide</ItemName>
## <ItemID>402</ItemID>
## <Brand>-</Brand>
## <Price>24.99</Price>
## <VariationID>402-B</VariationID>
## <VariationDetails>
## <Format>Paperback</Format>
## <Language>French</Language>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Sports Equipment</Category>
## <ItemName>Basketball</ItemName>
## <ItemID>501</ItemID>
## <Brand>SportsGear</Brand>
## <Price>29.99</Price>
## <VariationID>501-A</VariationID>
## <VariationDetails>
## <Size>Size 7</Size>
## <Color>Orange</Color>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Sports Equipment</Category>
## <ItemName>Tennis Racket</ItemName>
## <ItemID>502</ItemID>
## <Brand>RacketPro</Brand>
## <Price>89.99</Price>
## <VariationID>502-A</VariationID>
## <VariationDetails>
## <Material>Graphite</Material>
## <Color>Black</Color>
## </VariationDetails>
## </Item>
## <Item>
## <Category>Sports Equipment</Category>
## <ItemName>Tennis Racket</ItemName>
## <ItemID>502</ItemID>
## <Brand>RacketPro</Brand>
## <Price>89.99</Price>
## <VariationID>502-B</VariationID>
## <VariationDetails>
## <Material>Aluminum</Material>
## <Color>Silver</Color>
## </VariationDetails>
## </Item>
## </Items>
Used the class function to see if the XML file will readily show as a data frame
class(DF_XML)# output was a character
## [1] "character"
To obtain the data frame from the XML file I used the xmlToDataFrame function, now the data is ready to be used for analysis.
XML_df <-xmlToDataFrame(DF_XML)
XML_df
## Category ItemName ItemID Brand Price VariationID
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9 Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10 Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11 Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12 Clothing Jeans 302 DenimWorks 49.99 302-A
## 13 Clothing Jeans 302 DenimWorks 49.99 302-B
## 14 Books Fiction Novel 401 - 14.99 401-A
## 15 Books Fiction Novel 401 - 14.99 401-B
## 16 Books Non-Fiction Guide 402 - 24.99 402-A
## 17 Books Non-Fiction Guide 402 - 24.99 402-B
## 18 Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## VariationDetails
## 1 Black64GB
## 2 White128GB
## 3 Silver256GB
## 4 Space Gray512GB
## 5 Stainless Steel20 cu ft
## 6 White18 cu ft
## 7 Front Load4.5 cu ft
## 8 Top Load5.0 cu ft
## 9 BlueS
## 10 RedM
## 11 GreenL
## 12 Dark Blue32
## 13 Light Blue34
## 14 HardcoverEnglish
## 15 PaperbackSpanish
## 16 eBookEnglish
## 17 PaperbackFrench
## 18 Size 7Orange
## 19 GraphiteBlack
## 20 AluminumSilver
In order to get json code I used the data frame I created in R, using the toJSON function. Once the code was generated I created a json file using the texteditor application, and lastly I uploaded the file into github. The json codes reminds me of a python dictionary, which seems easy to code manually and easy to read.
Df_Json <- toJSON(DF_prep, pretty=TRUE)
cat(Df_Json)
## [
## {
## "Category": "Electronics",
## "Item_Name": "Smartphone",
## "Item_ID": 101,
## "Brand": "TechBrand",
## "Price": 699.99,
## "Variation_ID": "101-A",
## "Variation_Details": "Color: Black, Storage: 64GB"
## },
## {
## "Category": "Electronics",
## "Item_Name": "Smartphone",
## "Item_ID": 101,
## "Brand": "TechBrand",
## "Price": 699.99,
## "Variation_ID": "101-B",
## "Variation_Details": "Color: White, Storage: 128GB"
## },
## {
## "Category": "Electronics",
## "Item_Name": "Laptop",
## "Item_ID": 102,
## "Brand": "CompuBrand",
## "Price": 1099.99,
## "Variation_ID": "102-A",
## "Variation_Details": "Color: Silver, Storage: 256GB"
## },
## {
## "Category": "Electronics",
## "Item_Name": "Laptop",
## "Item_ID": 102,
## "Brand": "CompuBrand",
## "Price": 1099.99,
## "Variation_ID": "102-B",
## "Variation_Details": "Color: Space Gray, Storage: 512GB"
## },
## {
## "Category": "Home Appliances",
## "Item_Name": "Refrigerator",
## "Item_ID": 201,
## "Brand": "HomeCool",
## "Price": 899.99,
## "Variation_ID": "201-A",
## "Variation_Details": "Color: Stainless Steel, Capacity: 20 cu ft"
## },
## {
## "Category": "Home Appliances",
## "Item_Name": "Refrigerator",
## "Item_ID": 201,
## "Brand": "HomeCool",
## "Price": 899.99,
## "Variation_ID": "201-B",
## "Variation_Details": "Color: White, Capacity: 18 cu ft"
## },
## {
## "Category": "Home Appliances",
## "Item_Name": "Washing Machine",
## "Item_ID": 202,
## "Brand": "CleanTech",
## "Price": 499.99,
## "Variation_ID": "202-A",
## "Variation_Details": "Type: Front Load, Capacity: 4.5 cu ft"
## },
## {
## "Category": "Home Appliances",
## "Item_Name": "Washing Machine",
## "Item_ID": 202,
## "Brand": "CleanTech",
## "Price": 499.99,
## "Variation_ID": "202-B",
## "Variation_Details": "Type: Top Load, Capacity: 5.0 cu ft"
## },
## {
## "Category": "Clothing",
## "Item_Name": "T-Shirt",
## "Item_ID": 301,
## "Brand": "FashionCo",
## "Price": 19.99,
## "Variation_ID": "301-A",
## "Variation_Details": "Color: Blue, Size: S"
## },
## {
## "Category": "Clothing",
## "Item_Name": "T-Shirt",
## "Item_ID": 301,
## "Brand": "FashionCo",
## "Price": 19.99,
## "Variation_ID": "301-B",
## "Variation_Details": "Color: Red, Size: M"
## },
## {
## "Category": "Clothing",
## "Item_Name": "T-Shirt",
## "Item_ID": 301,
## "Brand": "FashionCo",
## "Price": 19.99,
## "Variation_ID": "301-C",
## "Variation_Details": "Color: Green, Size: L"
## },
## {
## "Category": "Clothing",
## "Item_Name": "Jeans",
## "Item_ID": 302,
## "Brand": "DenimWorks",
## "Price": 49.99,
## "Variation_ID": "302-A",
## "Variation_Details": "Color: Dark Blue, Size: 32"
## },
## {
## "Category": "Clothing",
## "Item_Name": "Jeans",
## "Item_ID": 302,
## "Brand": "DenimWorks",
## "Price": 49.99,
## "Variation_ID": "302-B",
## "Variation_Details": "Color: Light Blue, Size: 34"
## },
## {
## "Category": "Books",
## "Item_Name": "Fiction Novel",
## "Item_ID": 401,
## "Price": 14.99,
## "Variation_ID": "401-A",
## "Variation_Details": "Format: Hardcover, Language: English"
## },
## {
## "Category": "Books",
## "Item_Name": "Fiction Novel",
## "Item_ID": 401,
## "Price": 14.99,
## "Variation_ID": "401-B",
## "Variation_Details": "Format: Paperback, Language: Spanish"
## },
## {
## "Category": "Books",
## "Item_Name": "Non-Fiction Guide",
## "Item_ID": 402,
## "Price": 24.99,
## "Variation_ID": "402-A",
## "Variation_Details": "Format: eBook, Language: English"
## },
## {
## "Category": "Books",
## "Item_Name": "Non-Fiction Guide",
## "Item_ID": 402,
## "Price": 24.99,
## "Variation_ID": "402-B",
## "Variation_Details": "Format: Paperback, Language: French"
## },
## {
## "Category": "Sports Equipment",
## "Item_Name": "Basketball",
## "Item_ID": 501,
## "Brand": "SportsGear",
## "Price": 29.99,
## "Variation_ID": "501-A",
## "Variation_Details": "Size: Size 7, Color: Orange"
## },
## {
## "Category": "Sports Equipment",
## "Item_Name": "Tennis Racket",
## "Item_ID": 502,
## "Brand": "RacketPro",
## "Price": 89.99,
## "Variation_ID": "502-A",
## "Variation_Details": "Material: Graphite, Color: Black"
## },
## {
## "Category": "Sports Equipment",
## "Item_Name": "Tennis Racket",
## "Item_ID": 502,
## "Brand": "RacketPro",
## "Price": 89.99,
## "Variation_ID": "502-B",
## "Variation_Details": "Material: Aluminum, Color: Silver"
## }
## ]
Now that the file is created, I imported file from the github url using the fromjson function in R, which will parse the file back into a data frame.
Json_df<-fromJSON("https://raw.githubusercontent.com/Andreina-A/Assignment_7_Data607/refs/heads/main/json_df.json")
class(Json_df)
## [1] "data.frame"
**Using the data.table function I will create a variable called json _df to have the table of the json file. Now the json data dataframe is ready to be used for analysis.**
Json_df<-data.table(Json_df)
Json_df
## Category Item_Name Item_ID Brand Price Variation_ID
## <char> <char> <int> <char> <num> <char>
## 1: Electronics Smartphone 101 TechBrand 699.99 101-A
## 2: Electronics Smartphone 101 TechBrand 699.99 101-B
## 3: Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4: Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5: Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6: Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7: Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8: Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9: Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10: Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11: Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12: Clothing Jeans 302 DenimWorks 49.99 302-A
## 13: Clothing Jeans 302 DenimWorks 49.99 302-B
## 14: Books Fiction Novel 401 <NA> 14.99 401-A
## 15: Books Fiction Novel 401 <NA> 14.99 401-B
## 16: Books Non-Fiction Guide 402 <NA> 24.99 402-A
## 17: Books Non-Fiction Guide 402 <NA> 24.99 402-B
## 18: Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19: Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20: Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## Category Item_Name Item_ID Brand Price Variation_ID
## Variation_Details
## <char>
## 1: Color: Black, Storage: 64GB
## 2: Color: White, Storage: 128GB
## 3: Color: Silver, Storage: 256GB
## 4: Color: Space Gray, Storage: 512GB
## 5: Color: Stainless Steel, Capacity: 20 cu ft
## 6: Color: White, Capacity: 18 cu ft
## 7: Type: Front Load, Capacity: 4.5 cu ft
## 8: Type: Top Load, Capacity: 5.0 cu ft
## 9: Color: Blue, Size: S
## 10: Color: Red, Size: M
## 11: Color: Green, Size: L
## 12: Color: Dark Blue, Size: 32
## 13: Color: Light Blue, Size: 34
## 14: Format: Hardcover, Language: English
## 15: Format: Paperback, Language: Spanish
## 16: Format: eBook, Language: English
## 17: Format: Paperback, Language: French
## 18: Size: Size 7, Color: Orange
## 19: Material: Graphite, Color: Black
## 20: Material: Aluminum, Color: Silver
## Variation_Details
To create a parquet file I used the data frame I created in the beginning of this R markdown, to write a parquet file to save into my computer. I didn’t used the github to import this file becasuwe github wasn’t able to read the file, instead it just downloaded to file back to my computer.
write_parquet(DF_prep, "CUNYMart_Inventory.parquet")
#read parquet file
To read the parquet file I used the read_parquet function (which is from the “arrow” package)
Parquet_df<-read_parquet("CUNYMart_Inventory.parquet")
class(Parquet_df) #class function showed that Parquet_df is a data frame table
## [1] "tbl_df" "tbl" "data.frame"
head(Parquet_df)
## # A tibble: 6 × 7
## Category Item_Name Item_ID Brand Price Variation_ID Variation_Details
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 Electronics Smartphone 101 Tech… 700. 101-A Color: Black, St…
## 2 Electronics Smartphone 101 Tech… 700. 101-B Color: White, St…
## 3 Electronics Laptop 102 Comp… 1100. 102-A Color: Silver, S…
## 4 Electronics Laptop 102 Comp… 1100. 102-B Color: Space Gra…
## 5 Home Appliances Refrigerat… 201 Home… 900. 201-A Color: Stainless…
## 6 Home Appliances Refrigerat… 201 Home… 900. 201-B Color: White, Ca…
HTML, XML, and Json are file formats that can be created with coding in which are human readable, while Parquet requires a machine such as a R library in order to read and write than the Parquet file. I wasn’t able to load the Parquet file into github to see the coding if there is any. In my opinion Json would be the easiest file to write because you have to use the tags nor a library like parquet to create a file. I would as parquet would have less chance of mistakes as you will be creating the data frame first in a library which you can readily fix any errors without brackets or tags being in the way, and then you just create the parquet file. When comparing files sizes from descending order XML had the highest size of 7.13KB, HTML was 4.49KB, parquet was 4.37KB, and Json had a size of 4.3KB. I would say XML is has tags that are easier to understand than HTML, but XML taks up alot of space. Overall, I would prefer workign with Json files as they are space efficent and easier to write and read.