library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The below code block loads A csv file into the r workspace as a dataframe. Since there are only 7 column headers in the first row and there are 8 items separated by commas in all subsequent rows loading the dataframe causes an error. It loads the final two features of the csv to a single column.
cuny_mart_raw <- read_csv("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/cuny_mart.csv", show_col_types = FALSE)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
problems(cuny_mart_raw)
## # A tibble: 20 × 5
## row col expected actual file
## <int> <int> <chr> <chr> <chr>
## 1 2 8 7 columns 8 columns ""
## 2 3 8 7 columns 8 columns ""
## 3 4 8 7 columns 8 columns ""
## 4 5 8 7 columns 8 columns ""
## 5 6 8 7 columns 8 columns ""
## 6 7 8 7 columns 8 columns ""
## 7 8 8 7 columns 8 columns ""
## 8 9 8 7 columns 8 columns ""
## 9 10 8 7 columns 8 columns ""
## 10 11 8 7 columns 8 columns ""
## 11 12 8 7 columns 8 columns ""
## 12 13 8 7 columns 8 columns ""
## 13 14 8 7 columns 8 columns ""
## 14 15 8 7 columns 8 columns ""
## 15 16 8 7 columns 8 columns ""
## 16 17 8 7 columns 8 columns ""
## 17 18 8 7 columns 8 columns ""
## 18 19 8 7 columns 8 columns ""
## 19 20 8 7 columns 8 columns ""
## 20 21 8 7 columns 8 columns ""
head(cuny_mart_raw)
## # A tibble: 6 × 7
## Category `Item Name` `Item ID` Brand Price `Variation ID` `Variation Details`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 Electron… Smartphone 101 Tech… 700. 101-A Color: Black, Stor…
## 2 Electron… Smartphone 101 Tech… 700. 101-B Color: White, Stor…
## 3 Electron… Laptop 102 Comp… 1100. 102-A Color: Silver, Sto…
## 4 Electron… Laptop 102 Comp… 1100. 102-B Color: Space Gray,…
## 5 Home App… Refrigerat… 201 Home… 900. 201-A Color: Stainless S…
## 6 Home App… Refrigerat… 201 Home… 900. 201-B Color: White, Capa…
The following code separates out the data in the Variation Details into separate columns and then separating the key and item pairings. Finally it pivots the data wider so that each variation has its own column.
cuny_mart <- cuny_mart_raw |>
separate_wider_delim(`Variation Details`, delim = ", ", names = c("variation_details", "variation_details_2")) |>
separate_wider_delim(variation_details, delim = ": ", names = c("variation", "variation_detail")) |>
separate_wider_delim(variation_details_2, delim = ": ", names = c("variation_2", "variation_detail_2")) |>
pivot_wider(names_from = variation, values_from = variation_detail) |>
pivot_wider(names_from = variation_2, values_from = variation_detail_2, names_repair = "unique")
## New names:
## • `Color` -> `Color...7`
## • `Size` -> `Size...10`
## • `Size` -> `Size...14`
## • `Color` -> `Color...16`
head(cuny_mart)
## # A tibble: 6 × 16
## Category `Item Name` `Item ID` Brand Price `Variation ID` Color...7 Type
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 Electronics Smartphone 101 Tech… 700. 101-A Black <NA>
## 2 Electronics Smartphone 101 Tech… 700. 101-B White <NA>
## 3 Electronics Laptop 102 Comp… 1100. 102-A Silver <NA>
## 4 Electronics Laptop 102 Comp… 1100. 102-B Space Gr… <NA>
## 5 Home Applian… Refrigerat… 201 Home… 900. 201-A Stainles… <NA>
## 6 Home Applian… Refrigerat… 201 Home… 900. 201-B White <NA>
## # ℹ 8 more variables: Format <chr>, Size...10 <chr>, Material <chr>,
## # Storage <chr>, Capacity <chr>, Size...14 <chr>, Language <chr>,
## # Color...16 <chr>
The following code cleans up the original data and renames columns.
cuny_mart <- cuny_mart |>
unite(color, c(Color...7, Color...16), na.rm = TRUE) |>
unite(size, c(Size...10, Size...14), na.rm = TRUE) |>
rename(
category = Category,
item_name = `Item Name`,
item_id = `Item ID`,
brand = Brand,
price = Price,
variation_id = `Variation ID`,
type = Type,
format = Format,
material = Material,
storage = Storage,
capactiy = Capacity,
language = Language
)
head(cuny_mart)
## # A tibble: 6 × 14
## category item_name item_id brand price variation_id color type format size
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Electroni… Smartpho… 101 Tech… 700. 101-A Black <NA> <NA> ""
## 2 Electroni… Smartpho… 101 Tech… 700. 101-B White <NA> <NA> ""
## 3 Electroni… Laptop 102 Comp… 1100. 102-A Silv… <NA> <NA> ""
## 4 Electroni… Laptop 102 Comp… 1100. 102-B Spac… <NA> <NA> ""
## 5 Home Appl… Refriger… 201 Home… 900. 201-A Stai… <NA> <NA> ""
## 6 Home Appl… Refriger… 201 Home… 900. 201-B White <NA> <NA> ""
## # ℹ 4 more variables: material <chr>, storage <chr>, capactiy <chr>,
## # language <chr>
The following code loads packages to use for parquet.
library(arrow)
##
## Attaching package: 'arrow'
## The following object is masked from 'package:lubridate':
##
## duration
## The following object is masked from 'package:utils':
##
## timestamp
library(dplyr, warn.conflicts = FALSE)
The following code block exports the data from the csv to parquet format. Parquet has a significant advantage over other methods as it stores data in smaller file sizes than other formats. It stores data in columns and allows for larger files to be stored and loaded faster. The drawback is the format is no longer human readable. It is also not efficient for small datasets.
pqpath <- "assignment_7"
write_dataset(cuny_mart, path = pqpath, format = "parquet")
The following code displays the first 100 bit of the raw parquet file. It also reads the file directly from github.
readBin("https://raw.githubusercontent.com/mraynolds/data_607/main/part-0.parquet", what = "raw", n = 100)
## [1] 50 41 52 31 15 04 15 96 01 15 9c 01 4c 15 0a 15 00 12 00 00 4b f0 4a 0b 00
## [26] 00 00 45 6c 65 63 74 72 6f 6e 69 63 73 0f 00 00 00 48 6f 6d 65 20 41 70 70
## [51] 6c 69 61 6e 63 65 73 08 00 00 00 43 6c 6f 74 68 69 6e 67 05 00 00 00 42 6f
## [76] 6f 6b 73 10 00 00 00 53 70 6f 72 74 73 20 45 71 75 69 70 6d 65 6e 74 15 00
read_parquet("https://raw.githubusercontent.com/mraynolds/data_607/main/part-0.parquet", as_data_frame = TRUE)
## # A tibble: 20 × 14
## category item_name item_id brand price variation_id color type format size
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Electro… Smartpho… 101 Tech… 700. 101-A "Bla… <NA> <NA> ""
## 2 Electro… Smartpho… 101 Tech… 700. 101-B "Whi… <NA> <NA> ""
## 3 Electro… Laptop 102 Comp… 1100. 102-A "Sil… <NA> <NA> ""
## 4 Electro… Laptop 102 Comp… 1100. 102-B "Spa… <NA> <NA> ""
## 5 Home Ap… Refriger… 201 Home… 900. 201-A "Sta… <NA> <NA> ""
## 6 Home Ap… Refriger… 201 Home… 900. 201-B "Whi… <NA> <NA> ""
## 7 Home Ap… Washing … 202 Clea… 500. 202-A "" Fron… <NA> ""
## 8 Home Ap… Washing … 202 Clea… 500. 202-B "" Top … <NA> ""
## 9 Clothing T-Shirt 301 Fash… 20.0 301-A "Blu… <NA> <NA> "S"
## 10 Clothing T-Shirt 301 Fash… 20.0 301-B "Red" <NA> <NA> "M"
## 11 Clothing T-Shirt 301 Fash… 20.0 301-C "Gre… <NA> <NA> "L"
## 12 Clothing Jeans 302 Deni… 50.0 302-A "Dar… <NA> <NA> "32"
## 13 Clothing Jeans 302 Deni… 50.0 302-B "Lig… <NA> <NA> "34"
## 14 Books Fiction … 401 - 15.0 401-A "" <NA> Hardc… ""
## 15 Books Fiction … 401 - 15.0 401-B "" <NA> Paper… ""
## 16 Books Non-Fict… 402 - 25.0 402-A "" <NA> eBook ""
## 17 Books Non-Fict… 402 - 25.0 402-B "" <NA> Paper… ""
## 18 Sports … Basketba… 501 Spor… 30.0 501-A "Ora… <NA> <NA> "Siz…
## 19 Sports … Tennis R… 502 Rack… 90.0 502-A "Bla… <NA> <NA> ""
## 20 Sports … Tennis R… 502 Rack… 90.0 502-B "Sil… <NA> <NA> ""
## # ℹ 4 more variables: material <chr>, storage <chr>, capactiy <chr>,
## # language <chr>
The following code imports a parquet file into the r workspace and collects it for use in a dataframe.
cuny_mart_parquet <- read_parquet("https://raw.githubusercontent.com/mraynolds/data_607/main/part-0.parquet", as_data_frame = TRUE)
head(cuny_mart_parquet)
## # A tibble: 6 × 14
## category item_name item_id brand price variation_id color type format size
## <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Electroni… Smartpho… 101 Tech… 700. 101-A Black <NA> <NA> ""
## 2 Electroni… Smartpho… 101 Tech… 700. 101-B White <NA> <NA> ""
## 3 Electroni… Laptop 102 Comp… 1100. 102-A Silv… <NA> <NA> ""
## 4 Electroni… Laptop 102 Comp… 1100. 102-B Spac… <NA> <NA> ""
## 5 Home Appl… Refriger… 201 Home… 900. 201-A Stai… <NA> <NA> ""
## 6 Home Appl… Refriger… 201 Home… 900. 201-B White <NA> <NA> ""
## # ℹ 4 more variables: material <chr>, storage <chr>, capactiy <chr>,
## # language <chr>
A small bit of analysis: The following code displays the average price by category from the imported dataset.
cuny_mart_parquet |> group_by(category) |> summarize(
price = mean(price)
)
## # A tibble: 5 × 2
## category price
## <chr> <dbl>
## 1 Books 20.0
## 2 Clothing 32.0
## 3 Electronics 900.
## 4 Home Appliances 700.
## 5 Sports Equipment 70.0
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
The following code exports the original dataframe to JSON format and displays the raw JSON file.
cuny_mart_json_export <- toJSON(cuny_mart)
write(cuny_mart_json_export, "cuny_mart_json_export.json")
print(cuny_mart_json_export)
## [{"category":"Electronics","item_name":"Smartphone","item_id":101,"brand":"TechBrand","price":699.99,"variation_id":"101-A","color":"Black","size":"","storage":"64GB"},{"category":"Electronics","item_name":"Smartphone","item_id":101,"brand":"TechBrand","price":699.99,"variation_id":"101-B","color":"White","size":"","storage":"128GB"},{"category":"Electronics","item_name":"Laptop","item_id":102,"brand":"CompuBrand","price":1099.99,"variation_id":"102-A","color":"Silver","size":"","storage":"256GB"},{"category":"Electronics","item_name":"Laptop","item_id":102,"brand":"CompuBrand","price":1099.99,"variation_id":"102-B","color":"Space Gray","size":"","storage":"512GB"},{"category":"Home Appliances","item_name":"Refrigerator","item_id":201,"brand":"HomeCool","price":899.99,"variation_id":"201-A","color":"Stainless Steel","size":"","capactiy":"20 cu ft"},{"category":"Home Appliances","item_name":"Refrigerator","item_id":201,"brand":"HomeCool","price":899.99,"variation_id":"201-B","color":"White","size":"","capactiy":"18 cu ft"},{"category":"Home Appliances","item_name":"Washing Machine","item_id":202,"brand":"CleanTech","price":499.99,"variation_id":"202-A","color":"","type":"Front Load","size":"","capactiy":"4.5 cu ft"},{"category":"Home Appliances","item_name":"Washing Machine","item_id":202,"brand":"CleanTech","price":499.99,"variation_id":"202-B","color":"","type":"Top Load","size":"","capactiy":"5.0 cu ft"},{"category":"Clothing","item_name":"T-Shirt","item_id":301,"brand":"FashionCo","price":19.99,"variation_id":"301-A","color":"Blue","size":"S"},{"category":"Clothing","item_name":"T-Shirt","item_id":301,"brand":"FashionCo","price":19.99,"variation_id":"301-B","color":"Red","size":"M"},{"category":"Clothing","item_name":"T-Shirt","item_id":301,"brand":"FashionCo","price":19.99,"variation_id":"301-C","color":"Green","size":"L"},{"category":"Clothing","item_name":"Jeans","item_id":302,"brand":"DenimWorks","price":49.99,"variation_id":"302-A","color":"Dark Blue","size":"32"},{"category":"Clothing","item_name":"Jeans","item_id":302,"brand":"DenimWorks","price":49.99,"variation_id":"302-B","color":"Light Blue","size":"34"},{"category":"Books","item_name":"Fiction Novel","item_id":401,"brand":"-","price":14.99,"variation_id":"401-A","color":"","format":"Hardcover","size":"","language":"English"},{"category":"Books","item_name":"Fiction Novel","item_id":401,"brand":"-","price":14.99,"variation_id":"401-B","color":"","format":"Paperback","size":"","language":"Spanish"},{"category":"Books","item_name":"Non-Fiction Guide","item_id":402,"brand":"-","price":24.99,"variation_id":"402-A","color":"","format":"eBook","size":"","language":"English"},{"category":"Books","item_name":"Non-Fiction Guide","item_id":402,"brand":"-","price":24.99,"variation_id":"402-B","color":"","format":"Paperback","size":"","language":"French"},{"category":"Sports Equipment","item_name":"Basketball","item_id":501,"brand":"SportsGear","price":29.99,"variation_id":"501-A","color":"Orange","size":"Size 7"},{"category":"Sports Equipment","item_name":"Tennis Racket","item_id":502,"brand":"RacketPro","price":89.99,"variation_id":"502-A","color":"Black","size":"","material":"Graphite"},{"category":"Sports Equipment","item_name":"Tennis Racket","item_id":502,"brand":"RacketPro","price":89.99,"variation_id":"502-B","color":"Silver","size":"","material":"Aluminum"}]
The following code imports the data from file of the data written in JSON. Javascript object notation (JSON) is frequently the format for API returns and is easily written and read by machines but is also relatively easy to read for humans. It also allows for a hierarchical structure which may be useful but can also complicate importing the data. Finally, it is widely supported.
cuny_mart_json <- fromJSON("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/cuny_mart_json_export.json")
head(cuny_mart_json)
## category item_name item_id brand price variation_id
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## color size storage capactiy type format language material
## 1 Black 64GB <NA> <NA> <NA> <NA> <NA>
## 2 White 128GB <NA> <NA> <NA> <NA> <NA>
## 3 Silver 256GB <NA> <NA> <NA> <NA> <NA>
## 4 Space Gray 512GB <NA> <NA> <NA> <NA> <NA>
## 5 Stainless Steel <NA> 20 cu ft <NA> <NA> <NA> <NA>
## 6 White <NA> 18 cu ft <NA> <NA> <NA> <NA>
A small bit of analysis: The following code counts the colors of the products from the imported json dataset.
cuny_mart_json |> filter(color != "") |> count(color)
## color n
## 1 Black 2
## 2 Blue 1
## 3 Dark Blue 1
## 4 Green 1
## 5 Light Blue 1
## 6 Orange 1
## 7 Red 1
## 8 Silver 2
## 9 Space Gray 1
## 10 Stainless Steel 1
## 11 White 2
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(htmlTable)
The following code exports the data from the prepared dataframe to an html file and displays the raw html file.
cuny_mart_html_export <- htmlTable(cuny_mart)
write(cuny_mart_html_export, "cuny_mart_html_export.html")
cat(cuny_mart_html_export)
## <table class='gmisc_table' style='border-collapse: collapse; margin-top: 1em; margin-bottom: 1em;' >
## <thead>
## <tr><th style='border-bottom: 1px solid grey; border-top: 2px solid grey;'></th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>category</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>item_name</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>item_id</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>brand</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>price</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>variation_id</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>color</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>type</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>format</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>size</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>material</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>storage</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>capactiy</th>
## <th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>language</th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td style='text-align: left;'>1</td>
## <td style='text-align: center;'>Electronics</td>
## <td style='text-align: center;'>Smartphone</td>
## <td style='text-align: center;'>101</td>
## <td style='text-align: center;'>TechBrand</td>
## <td style='text-align: center;'>699.99</td>
## <td style='text-align: center;'>101-A</td>
## <td style='text-align: center;'>Black</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>64GB</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>2</td>
## <td style='text-align: center;'>Electronics</td>
## <td style='text-align: center;'>Smartphone</td>
## <td style='text-align: center;'>101</td>
## <td style='text-align: center;'>TechBrand</td>
## <td style='text-align: center;'>699.99</td>
## <td style='text-align: center;'>101-B</td>
## <td style='text-align: center;'>White</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>128GB</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>3</td>
## <td style='text-align: center;'>Electronics</td>
## <td style='text-align: center;'>Laptop</td>
## <td style='text-align: center;'>102</td>
## <td style='text-align: center;'>CompuBrand</td>
## <td style='text-align: center;'>1099.99</td>
## <td style='text-align: center;'>102-A</td>
## <td style='text-align: center;'>Silver</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>256GB</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>4</td>
## <td style='text-align: center;'>Electronics</td>
## <td style='text-align: center;'>Laptop</td>
## <td style='text-align: center;'>102</td>
## <td style='text-align: center;'>CompuBrand</td>
## <td style='text-align: center;'>1099.99</td>
## <td style='text-align: center;'>102-B</td>
## <td style='text-align: center;'>Space Gray</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>512GB</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>5</td>
## <td style='text-align: center;'>Home Appliances</td>
## <td style='text-align: center;'>Refrigerator</td>
## <td style='text-align: center;'>201</td>
## <td style='text-align: center;'>HomeCool</td>
## <td style='text-align: center;'>899.99</td>
## <td style='text-align: center;'>201-A</td>
## <td style='text-align: center;'>Stainless Steel</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>20 cu ft</td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>6</td>
## <td style='text-align: center;'>Home Appliances</td>
## <td style='text-align: center;'>Refrigerator</td>
## <td style='text-align: center;'>201</td>
## <td style='text-align: center;'>HomeCool</td>
## <td style='text-align: center;'>899.99</td>
## <td style='text-align: center;'>201-B</td>
## <td style='text-align: center;'>White</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>18 cu ft</td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>7</td>
## <td style='text-align: center;'>Home Appliances</td>
## <td style='text-align: center;'>Washing Machine</td>
## <td style='text-align: center;'>202</td>
## <td style='text-align: center;'>CleanTech</td>
## <td style='text-align: center;'>499.99</td>
## <td style='text-align: center;'>202-A</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Front Load</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>4.5 cu ft</td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>8</td>
## <td style='text-align: center;'>Home Appliances</td>
## <td style='text-align: center;'>Washing Machine</td>
## <td style='text-align: center;'>202</td>
## <td style='text-align: center;'>CleanTech</td>
## <td style='text-align: center;'>499.99</td>
## <td style='text-align: center;'>202-B</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Top Load</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>5.0 cu ft</td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>9</td>
## <td style='text-align: center;'>Clothing</td>
## <td style='text-align: center;'>T-Shirt</td>
## <td style='text-align: center;'>301</td>
## <td style='text-align: center;'>FashionCo</td>
## <td style='text-align: center;'>19.99</td>
## <td style='text-align: center;'>301-A</td>
## <td style='text-align: center;'>Blue</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>S</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>10</td>
## <td style='text-align: center;'>Clothing</td>
## <td style='text-align: center;'>T-Shirt</td>
## <td style='text-align: center;'>301</td>
## <td style='text-align: center;'>FashionCo</td>
## <td style='text-align: center;'>19.99</td>
## <td style='text-align: center;'>301-B</td>
## <td style='text-align: center;'>Red</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>M</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>11</td>
## <td style='text-align: center;'>Clothing</td>
## <td style='text-align: center;'>T-Shirt</td>
## <td style='text-align: center;'>301</td>
## <td style='text-align: center;'>FashionCo</td>
## <td style='text-align: center;'>19.99</td>
## <td style='text-align: center;'>301-C</td>
## <td style='text-align: center;'>Green</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>L</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>12</td>
## <td style='text-align: center;'>Clothing</td>
## <td style='text-align: center;'>Jeans</td>
## <td style='text-align: center;'>302</td>
## <td style='text-align: center;'>DenimWorks</td>
## <td style='text-align: center;'>49.99</td>
## <td style='text-align: center;'>302-A</td>
## <td style='text-align: center;'>Dark Blue</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>32</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>13</td>
## <td style='text-align: center;'>Clothing</td>
## <td style='text-align: center;'>Jeans</td>
## <td style='text-align: center;'>302</td>
## <td style='text-align: center;'>DenimWorks</td>
## <td style='text-align: center;'>49.99</td>
## <td style='text-align: center;'>302-B</td>
## <td style='text-align: center;'>Light Blue</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>34</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>14</td>
## <td style='text-align: center;'>Books</td>
## <td style='text-align: center;'>Fiction Novel</td>
## <td style='text-align: center;'>401</td>
## <td style='text-align: center;'>-</td>
## <td style='text-align: center;'>14.99</td>
## <td style='text-align: center;'>401-A</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Hardcover</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>English</td>
## </tr>
## <tr>
## <td style='text-align: left;'>15</td>
## <td style='text-align: center;'>Books</td>
## <td style='text-align: center;'>Fiction Novel</td>
## <td style='text-align: center;'>401</td>
## <td style='text-align: center;'>-</td>
## <td style='text-align: center;'>14.99</td>
## <td style='text-align: center;'>401-B</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Paperback</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Spanish</td>
## </tr>
## <tr>
## <td style='text-align: left;'>16</td>
## <td style='text-align: center;'>Books</td>
## <td style='text-align: center;'>Non-Fiction Guide</td>
## <td style='text-align: center;'>402</td>
## <td style='text-align: center;'>-</td>
## <td style='text-align: center;'>24.99</td>
## <td style='text-align: center;'>402-A</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>eBook</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>English</td>
## </tr>
## <tr>
## <td style='text-align: left;'>17</td>
## <td style='text-align: center;'>Books</td>
## <td style='text-align: center;'>Non-Fiction Guide</td>
## <td style='text-align: center;'>402</td>
## <td style='text-align: center;'>-</td>
## <td style='text-align: center;'>24.99</td>
## <td style='text-align: center;'>402-B</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Paperback</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>French</td>
## </tr>
## <tr>
## <td style='text-align: left;'>18</td>
## <td style='text-align: center;'>Sports Equipment</td>
## <td style='text-align: center;'>Basketball</td>
## <td style='text-align: center;'>501</td>
## <td style='text-align: center;'>SportsGear</td>
## <td style='text-align: center;'>29.99</td>
## <td style='text-align: center;'>501-A</td>
## <td style='text-align: center;'>Orange</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Size 7</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='text-align: left;'>19</td>
## <td style='text-align: center;'>Sports Equipment</td>
## <td style='text-align: center;'>Tennis Racket</td>
## <td style='text-align: center;'>502</td>
## <td style='text-align: center;'>RacketPro</td>
## <td style='text-align: center;'>89.99</td>
## <td style='text-align: center;'>502-A</td>
## <td style='text-align: center;'>Black</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'>Graphite</td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## <td style='text-align: center;'></td>
## </tr>
## <tr>
## <td style='border-bottom: 2px solid grey; text-align: left;'>20</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>Sports Equipment</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>Tennis Racket</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>502</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>RacketPro</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>89.99</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>502-B</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>Silver</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'></td>
## <td style='border-bottom: 2px solid grey; text-align: center;'></td>
## <td style='border-bottom: 2px solid grey; text-align: center;'></td>
## <td style='border-bottom: 2px solid grey; text-align: center;'>Aluminum</td>
## <td style='border-bottom: 2px solid grey; text-align: center;'></td>
## <td style='border-bottom: 2px solid grey; text-align: center;'></td>
## <td style='border-bottom: 2px solid grey; text-align: center;'></td>
## </tr>
## </tbody>
## </table>
The following code imports the data from a file of the data written in html. Hypertext Markup Language or HTML is used widely as a markup language, commonly in web pages. It is common and so manipulating data in html is a useful way of accessing data presented on webpages. The html from a website can then be scraped and then transformed. It is also relatively easy for humans to read basic html. However, it is not optimized for storing or communicating large amounts of data.
cuny_mart_html <- read_html("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/cuny_mart_html_export.html", )
cuny_mart_html <- cuny_mart_html |> html_element("table")|> html_table() |> select(category:language)
head(cuny_mart_html)
## # A tibble: 6 × 14
## category item_name item_id brand price variation_id color type format size
## <chr> <chr> <int> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Electroni… Smartpho… 101 Tech… 700. 101-A Black "" "" ""
## 2 Electroni… Smartpho… 101 Tech… 700. 101-B White "" "" ""
## 3 Electroni… Laptop 102 Comp… 1100. 102-A Silv… "" "" ""
## 4 Electroni… Laptop 102 Comp… 1100. 102-B Spac… "" "" ""
## 5 Home Appl… Refriger… 201 Home… 900. 201-A Stai… "" "" ""
## 6 Home Appl… Refriger… 201 Home… 900. 201-B White "" "" ""
## # ℹ 4 more variables: material <chr>, storage <chr>, capactiy <chr>,
## # language <chr>
A small bit of analysis: The following code counts the number items in each category.
cuny_mart_html |> count(category)
## # A tibble: 5 × 2
## category n
## <chr> <int>
## 1 Books 4
## 2 Clothing 5
## 3 Electronics 4
## 4 Home Appliances 4
## 5 Sports Equipment 3
The following code loads packages for use with xml.
library(XML)
library(xml2)
library(methods)
library(xmlconvert)
The following code imports an xml file and displays the raw XML data. XML is Extensible Markup Language and is useful for carrying and storing data. It allows customization through tags and is hierarchical. However, it tends to have larger file sizes and slower queries so it is not useful for big data.
cuny_mart_xml <- read_xml("https://raw.githubusercontent.com/mraynolds/data_607/refs/heads/main/cuny_mart_xml.xml")
print(cuny_mart_xml)
## {xml_document}
## <root encoding="UTF-8">
## [1] <record>\n <category>Electronics</category>\n <item_name>Smartphone</i ...
## [2] <record>\n <category>Electronics</category>\n <item_name>Smartphone</i ...
## [3] <record>\n <category>Electronics</category>\n <item_name>Laptop</item_ ...
## [4] <record>\n <category>Electronics</category>\n <item_name>Laptop</item_ ...
## [5] <record>\n <category>Home Appliances</category>\n <item_name>Refrigera ...
## [6] <record>\n <category>Home Appliances</category>\n <item_name>Refrigera ...
## [7] <record>\n <category>Home Appliances</category>\n <item_name>Washing M ...
## [8] <record>\n <category>Home Appliances</category>\n <item_name>Washing M ...
## [9] <record>\n <category>Clothing</category>\n <item_name>T-Shirt</item_na ...
## [10] <record>\n <category>Clothing</category>\n <item_name>T-Shirt</item_na ...
## [11] <record>\n <category>Clothing</category>\n <item_name>T-Shirt</item_na ...
## [12] <record>\n <category>Clothing</category>\n <item_name>Jeans</item_name ...
## [13] <record>\n <category>Clothing</category>\n <item_name>Jeans</item_name ...
## [14] <record>\n <category>Books</category>\n <item_name>Fiction Novel</item ...
## [15] <record>\n <category>Books</category>\n <item_name>Fiction Novel</item ...
## [16] <record>\n <category>Books</category>\n <item_name>Non-Fiction Guide</ ...
## [17] <record>\n <category>Books</category>\n <item_name>Non-Fiction Guide</ ...
## [18] <record>\n <category>Sports Equipment</category>\n <item_name>Basketba ...
## [19] <record>\n <category>Sports Equipment</category>\n <item_name>Tennis R ...
## [20] <record>\n <category>Sports Equipment</category>\n <item_name>Tennis R ...
The following code block parses the XML and creates a dataframe from it.
cuny_mart_xml <- xmlParse(cuny_mart_xml)
cuny_mart_xml <- xmlToDataFrame(cuny_mart_xml)
head(cuny_mart_xml)
## category item_name item_id brand price variation_id
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## color size storage capactiy type format language material
## 1 Black 64GB <NA> <NA> <NA> <NA> <NA>
## 2 White 128GB <NA> <NA> <NA> <NA> <NA>
## 3 Silver 256GB <NA> <NA> <NA> <NA> <NA>
## 4 Space Gray 512GB <NA> <NA> <NA> <NA> <NA>
## 5 Stainless Steel <NA> 20 cu ft <NA> <NA> <NA> <NA>
## 6 White <NA> 18 cu ft <NA> <NA> <NA> <NA>
The following code exports a dataframe to xml.
write_xml(df_to_xml(cuny_mart), "cuny_mart_xml.xml")
A small bit of analysis: The following code shows the dollars per gigabyte for the electronic items.
cuny_mart_xml |> mutate(
storage = str_remove_all(storage, "\\D*"),
dollar_per_gb = (as.numeric(price)/as.numeric(storage))
) |> filter(dollar_per_gb != "NA") |> select(item_name, brand, price,
dollar_per_gb
)
## item_name brand price dollar_per_gb
## 1 Smartphone TechBrand 699.99 10.937344
## 2 Smartphone TechBrand 699.99 5.468672
## 3 Laptop CompuBrand 1099.99 4.296836
## 4 Laptop CompuBrand 1099.99 2.148418