This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
** Data 607 Assignment: working with JSON, HTML, XML, and Parquet in R You have received the following data from CUNYMart, located at 123 Example Street, Anytown, USA. Category,Item Name,Item ID,Brand,Price,Variation ID,Variation Details Electronics,Smartphone,101,TechBrand,699.99,101-A,Color: Black, Storage: 64GB Electronics,Smartphone,101,TechBrand,699.99,101-B,Color: White, Storage: 128GB Electronics,Laptop,102,CompuBrand,1099.99,102-A,Color: Silver, Storage: 256GB Electronics,Laptop,102,CompuBrand,1099.99,102-B,Color: Space Gray, Storage: 512GB
Home Appliances,Refrigerator,201,HomeCool,899.99,201-A,Color: Stainless Steel, Capacity: 20 cu ft Home Appliances,Refrigerator,201,HomeCool,899.99,201-B,Color: White, Capacity: 18 cu ft Home Appliances,Washing Machine,202,CleanTech,499.99,202-A,Type: Front Load, Capacity: 4.5 cu ft Home Appliances,Washing Machine,202,CleanTech,499.99,202-B,Type: Top Load, Capacity: 5.0 cu ft
Clothing,T-Shirt,301,FashionCo,19.99,301-A,Color: Blue, Size: S Clothing,T-Shirt,301,FashionCo,19.99,301-B,Color: Red, Size: M Clothing,T-Shirt,301,FashionCo,19.99,301-C,Color: Green, Size: L Clothing,Jeans,302,DenimWorks,49.99,302-A,Color: Dark Blue, Size: 32 Clothing,Jeans,302,DenimWorks,49.99,302-B,Color: Light Blue, Size: 34 Books,Fiction Novel,401,-,14.99,401-A,Format: Hardcover, Language: English Books,Fiction Novel,401,-,14.99,401-B,Format: Paperback, Language: Spanish Books,Non-Fiction Guide,402,-,24.99,402-A,Format: eBook, Language: English Books,Non-Fiction Guide,402,-,24.99,402-B,Format: Paperback, Language: French Sports Equipment,Basketball,501,SportsGear,29.99,501-A,Size: Size 7, Color: Orange Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-A,Material: Graphite, Color: Black Sports Equipment,Tennis Racket,502,RacketPro,89.99,502-B,Material: Aluminum, Color: Silver
This data will be used for inventory analysis at the retailer. You are required to prepare the data for analysis by formatting it in JSON, HTML, XML, and Parquet. Additionally, provide the pros and cons of each format. Your must include R code for generating and importing the data into R. **
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(XML)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(RCurl)
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
library(jsonlite)
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:purrr':
##
## flatten
library(httr)
library(XML)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:purrr':
##
## compact
library(dplyr)
library(xml2)
library(tidyverse)
#Create a html:
Category | Item Name | Item ID | Brand | Price | Variation ID | Color | Variation Details | |
---|---|---|---|---|---|---|---|---|
1 | Electronics | Smartphone | 101 | TechBrand | 699.99 | 101-A | Color: Black | Storage: 64GB |
2 | Electronics | Smartphone | 101 | TechBrand | 699.99 | 101-B | Color: White | Storage: 128GB |
3 | Electronics | Laptop | 102 | CompuBrand | 1099.99 | 102-A | Color: Silver | Storage: 256GB |
4 | Electronics | Laptop | 102 | CompuBrand | 1099.99 | 102-B | Color: Space Gray | Storage: 512GB |
5 | Home Appliances | Refrigerator | 201 | HomeCool | 899.99 | 201-A | Color: Stainless Steel | Capacity: 20 cu ft |
6 | Home Appliances | Refrigerator | 201 | HomeCool | 899.99 | 201-B | Color: White | Capacity: 18 cu ft |
7 | Home Appliances | Washing Machine | 202 | CleanTech | 499.99 | 202-A | Type: Front Load | Capacity: 4.5 cu ft |
8 | Home Appliances | Washing Machine | 202 | CleanTech | 499.99 | 202-B | Type: Top Load | Capacity: 5.0 cu ft |
9 | Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-A | Color: Blue | Size: S |
10 | Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-B | Color: Red | Size: M |
11 | Clothing | T-Shirt | 301 | FashionCo | 19.99 | 301-C | Color: Green | Size: L |
12 | Clothing | Jeans | 302 | DenimWorks | 49.99 | 302-A | Color: Dark Blue | Size: 32 |
13 | Clothing | Jeans | 302 | DenimWorks | 49.99 | 302-B | Color: Light Blue | Size: 34 |
14 | Books | Fiction Novel | 401 | - | 14.99 | 401-A | Format: Hardcover | Language: English |
15 | Books | Fiction Novel | 401 | - | 14.99 | 401-B | Format: Paperback | Language: Spanish |
16 | Books | Non-Fiction Guide | 402 | - | 24.99 | 402-A | Format: eBook | Language: English |
17 | Books | Non-Fiction Guide | 402 | - | 24.99 | 402-B | Format: Paperback | Language: French |
18 | Sports Equipment | Basketball | 501 | SportsGear | 29.99 | 501-A | Size: Size 7 | Color: Orange |
19 | Sports Equipment | Tennis Racket | 502 | RacketPro | 89.99 | 502-A | Material: Graphite | Color: Black |
20 | Sports Equipment | Tennis Racket | 502 | RacketPro | 89.99 | 502-B | Material: Aluminum | Color: Silver |
url <- getURL('https://raw.githubusercontent.com/asadny82/Data607/refs/heads/main/week7Assignment.html')
data_HTML <- url %>%
read_html(encoding = 'UTF-8') %>%
html_table(header = NA, trim = TRUE) %>%
.[[1]]
data_HTML
## # A tibble: 20 × 9
## `` Category `Item Name` `Item ID` Brand Price `Variation ID` Color
## <int> <chr> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 1 Electronics Smartphone 101 Tech… 700. 101-A Black
## 2 2 Electronics Smartphone 101 Tech… 700. 101-B White
## 3 3 Electronics Laptop 102 Comp… 1100. 102-A Silv…
## 4 4 Electronics Laptop 102 Comp… 1100. 102-B Spac…
## 5 5 Home Appliances Refrigerat… 201 Home… 900. 201-A Stai…
## 6 6 Home Appliances Refrigerat… 201 Home… 900. 201-B White
## 7 7 Home Appliances Washing Ma… 202 Clea… 500. 202-A Fron…
## 8 8 Home Appliances Washing Ma… 202 Clea… 500. 202-B Type…
## 9 9 Clothing T-Shirt 301 Fash… 20.0 301-A Blue
## 10 10 Clothing T-Shirt 301 Fash… 20.0 301-B Red
## 11 11 Clothing T-Shirt 301 Fash… 20.0 301-C Green
## 12 12 Clothing Jeans 302 Deni… 50.0 302-A Dark…
## 13 13 Clothing Jeans 302 Deni… 50.0 302-B Ligh…
## 14 14 Books Fiction No… 401 - 15.0 401-A Hard…
## 15 15 Books Fiction No… 401 - 15.0 401-B Pape…
## 16 16 Books Non-Fictio… 402 - 25.0 402-A eBook
## 17 17 Books Non-Fictio… 402 - 25.0 402-B Form…
## 18 18 Sports Equipme… Basketball 501 Spor… 30.0 501-A Size…
## 19 19 Sports Equipme… Tennis Rac… 502 Rack… 90.0 502-A Black
## 20 20 Sports Equipme… Tennis Rac… 502 Rack… 90.0 502-B Silv…
## # ℹ 1 more variable: `Variation Details` <chr>
names(data_HTML)[1] <- 'x'
names(data_HTML)[2] <- 'catagory'
names(data_HTML)[3] <-'itemName'
names(data_HTML)[4] <- 'itemID'
names(data_HTML)[5] <- 'brand'
names(data_HTML)[6] <- 'price'
names(data_HTML)[7] <- 'variationID'
names(data_HTML)[8] <- 'color'
names(data_HTML)[9] <- 'variationDetails'
data_HTML
## # A tibble: 20 × 9
## x catagory itemName itemID brand price variationID color
## <int> <chr> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 1 Electronics Smartphone 101 Tech… 700. 101-A Black
## 2 2 Electronics Smartphone 101 Tech… 700. 101-B White
## 3 3 Electronics Laptop 102 Comp… 1100. 102-A Silv…
## 4 4 Electronics Laptop 102 Comp… 1100. 102-B Spac…
## 5 5 Home Appliances Refrigerator 201 Home… 900. 201-A Stai…
## 6 6 Home Appliances Refrigerator 201 Home… 900. 201-B White
## 7 7 Home Appliances Washing Machine 202 Clea… 500. 202-A Fron…
## 8 8 Home Appliances Washing Machine 202 Clea… 500. 202-B Type…
## 9 9 Clothing T-Shirt 301 Fash… 20.0 301-A Blue
## 10 10 Clothing T-Shirt 301 Fash… 20.0 301-B Red
## 11 11 Clothing T-Shirt 301 Fash… 20.0 301-C Green
## 12 12 Clothing Jeans 302 Deni… 50.0 302-A Dark…
## 13 13 Clothing Jeans 302 Deni… 50.0 302-B Ligh…
## 14 14 Books Fiction Novel 401 - 15.0 401-A Hard…
## 15 15 Books Fiction Novel 401 - 15.0 401-B Pape…
## 16 16 Books Non-Fiction Gui… 402 - 25.0 402-A eBook
## 17 17 Books Non-Fiction Gui… 402 - 25.0 402-B Form…
## 18 18 Sports Equipment Basketball 501 Spor… 30.0 501-A Size…
## 19 19 Sports Equipment Tennis Racket 502 Rack… 90.0 502-A Black
## 20 20 Sports Equipment Tennis Racket 502 Rack… 90.0 502-B Silv…
## # ℹ 1 more variable: variationDetails <chr>
data_HTML <- na.omit(data_HTML) %>%
mutate(color = na_if(color,'')) %>%
fill(color, .direction = 'down') %>%
mutate(variationDetails = str_replace(variationDetails, 'Size: S','Storage: 64GB'),
variationDetails = str_replace(variationDetails, 'Size: M','Storage: 64GB')) %>%
group_by(itemName)
data
## function (..., list = character(), package = NULL, lib.loc = NULL,
## verbose = getOption("verbose"), envir = .GlobalEnv, overwrite = TRUE)
## {
## fileExt <- function(x) {
## db <- grepl("\\.[^.]+\\.(gz|bz2|xz)$", x)
## ans <- sub(".*\\.", "", x)
## ans[db] <- sub(".*\\.([^.]+\\.)(gz|bz2|xz)$", "\\1\\2",
## x[db])
## ans
## }
## my_read_table <- function(...) {
## lcc <- Sys.getlocale("LC_COLLATE")
## on.exit(Sys.setlocale("LC_COLLATE", lcc))
## Sys.setlocale("LC_COLLATE", "C")
## read.table(...)
## }
## stopifnot(is.character(list))
## names <- c(as.character(substitute(list(...))[-1L]), list)
## if (!is.null(package)) {
## if (!is.character(package))
## stop("'package' must be a character vector or NULL")
## }
## paths <- find.package(package, lib.loc, verbose = verbose)
## if (is.null(lib.loc))
## paths <- c(path.package(package, TRUE), if (!length(package)) getwd(),
## paths)
## paths <- unique(normalizePath(paths[file.exists(paths)]))
## paths <- paths[dir.exists(file.path(paths, "data"))]
## dataExts <- tools:::.make_file_exts("data")
## if (length(names) == 0L) {
## db <- matrix(character(), nrow = 0L, ncol = 4L)
## for (path in paths) {
## entries <- NULL
## packageName <- if (file_test("-f", file.path(path,
## "DESCRIPTION")))
## basename(path)
## else "."
## if (file_test("-f", INDEX <- file.path(path, "Meta",
## "data.rds"))) {
## entries <- readRDS(INDEX)
## }
## else {
## dataDir <- file.path(path, "data")
## entries <- tools::list_files_with_type(dataDir,
## "data")
## if (length(entries)) {
## entries <- unique(tools::file_path_sans_ext(basename(entries)))
## entries <- cbind(entries, "")
## }
## }
## if (NROW(entries)) {
## if (is.matrix(entries) && ncol(entries) == 2L)
## db <- rbind(db, cbind(packageName, dirname(path),
## entries))
## else warning(gettextf("data index for package %s is invalid and will be ignored",
## sQuote(packageName)), domain = NA, call. = FALSE)
## }
## }
## colnames(db) <- c("Package", "LibPath", "Item", "Title")
## footer <- if (missing(package))
## paste0("Use ", sQuote(paste("data(package =", ".packages(all.available = TRUE))")),
## "\n", "to list the data sets in all *available* packages.")
## else NULL
## y <- list(title = "Data sets", header = NULL, results = db,
## footer = footer)
## class(y) <- "packageIQR"
## return(y)
## }
## paths <- file.path(paths, "data")
## for (name in names) {
## found <- FALSE
## for (p in paths) {
## tmp_env <- if (overwrite)
## envir
## else new.env()
## if (file_test("-f", file.path(p, "Rdata.rds"))) {
## rds <- readRDS(file.path(p, "Rdata.rds"))
## if (name %in% names(rds)) {
## found <- TRUE
## if (verbose)
## message(sprintf("name=%s:\t found in Rdata.rds",
## name), domain = NA)
## thispkg <- sub(".*/([^/]*)/data$", "\\1", p)
## thispkg <- sub("_.*$", "", thispkg)
## thispkg <- paste0("package:", thispkg)
## objs <- rds[[name]]
## lazyLoad(file.path(p, "Rdata"), envir = tmp_env,
## filter = function(x) x %in% objs)
## break
## }
## else if (verbose)
## message(sprintf("name=%s:\t NOT found in names() of Rdata.rds, i.e.,\n\t%s\n",
## name, paste(names(rds), collapse = ",")),
## domain = NA)
## }
## files <- list.files(p, full.names = TRUE)
## files <- files[grep(name, files, fixed = TRUE)]
## if (length(files) > 1L) {
## o <- match(fileExt(files), dataExts, nomatch = 100L)
## paths0 <- dirname(files)
## paths0 <- factor(paths0, levels = unique(paths0))
## files <- files[order(paths0, o)]
## }
## if (length(files)) {
## for (file in files) {
## if (verbose)
## message("name=", name, ":\t file= ...", .Platform$file.sep,
## basename(file), "::\t", appendLF = FALSE,
## domain = NA)
## ext <- fileExt(file)
## if (basename(file) != paste0(name, ".", ext))
## found <- FALSE
## else {
## found <- TRUE
## switch(ext, R = , r = {
## library("utils")
## sys.source(file, chdir = TRUE, envir = tmp_env)
## }, RData = , rdata = , rda = load(file, envir = tmp_env),
## TXT = , txt = , tab = , tab.gz = , tab.bz2 = ,
## tab.xz = , txt.gz = , txt.bz2 = , txt.xz = assign(name,
## my_read_table(file, header = TRUE, as.is = FALSE),
## envir = tmp_env), CSV = , csv = , csv.gz = ,
## csv.bz2 = , csv.xz = assign(name, my_read_table(file,
## header = TRUE, sep = ";", as.is = FALSE),
## envir = tmp_env), found <- FALSE)
## }
## if (found)
## break
## }
## if (verbose)
## message(if (!found)
## "*NOT* ", "found", domain = NA)
## }
## if (found)
## break
## }
## if (!found) {
## warning(gettextf("data set %s not found", sQuote(name)),
## domain = NA)
## }
## else if (!overwrite) {
## for (o in ls(envir = tmp_env, all.names = TRUE)) {
## if (exists(o, envir = envir, inherits = FALSE))
## warning(gettextf("an object named %s already exists and will not be overwritten",
## sQuote(o)))
## else assign(o, get(o, envir = tmp_env, inherits = FALSE),
## envir = envir)
## }
## rm(tmp_env)
## }
## }
## invisible(names)
## }
## <bytecode: 0x000001f2e8c02960>
## <environment: namespace:utils>
Json: [ { “Category”: ” Electronics”, “Item Name”: ” Smartphone”, “Item ID”: “101”, “Brand”: ” TechBrand”, “Price”: “699.99”, ” Variation ID”: ” 101-A”, ” Color”: ” Color: Black”, ” Variation Details”: ” Storage: 64GB” }, { “Category”: ” Electronics”, “Item Name”: “Smartphone”, “Item ID”: “101”, “Brand”: “TechBrand”, “Price”: “699.99”, ” Variation ID”: “101-B”, ” Color”: “Color: White”, ” Variation Details”: ” Storage: 128GB” }, { “Category”: ” Electronics”, “Item Name”: “Laptop”, “Item ID”: “102”, “Brand”: “CompuBrand”, “Price”: “1099.99”, ” Variation ID”: “102-A”, ” Color”: “Color: Silver”, ” Variation Details”: ” Storage: 256GB” }, { “Category”: ” Electronics”, “Item Name”: “Laptop”, “Item ID”: “102”, “Brand”: “CompuBrand”, “Price”: “1099.99”, ” Variation ID”: “102-B”, ” Color”: “Color: Space Gray”, ” Variation Details”: ” Storage: 512GB” }, { “Category”: ” Home Appliances”, “Item Name”: “Refrigerator”, “Item ID”: “201”, “Brand”: “HomeCool”, “Price”: “899.99”, ” Variation ID”: “201-A”, ” Color”: “Color: Stainless Steel”, ” Variation Details”: ” Capacity: 20 cu ft” }, { “Category”: ” Home Appliances”, “Item Name”: “Refrigerator”, “Item ID”: “201”, “Brand”: “HomeCool”, “Price”: “899.99”, ” Variation ID”: “201-B”, ” Color”: “Color: White”, ” Variation Details”: ” Capacity: 18 cu ft” }, { “Category”: ” Home Appliances”, “Item Name”: “Washing Machine”, “Item ID”: “202”, “Brand”: “CleanTech”, “Price”: “499.99”, ” Variation ID”: “202-A”, ” Color”: “Type: Front Load”, ” Variation Details”: ” Capacity: 4.5 cu ft” }, { “Category”: ” Home Appliances”, “Item Name”: “Washing Machine”, “Item ID”: “202”, “Brand”: “CleanTech”, “Price”: “499.99”, ” Variation ID”: “202-B”, ” Color”: “Type: Top Load”, ” Variation Details”: ” Capacity: 5.0 cu ft” }, { “Category”: ” Clothing”, “Item Name”: “T-Shirt”, “Item ID”: “301”, “Brand”: “FashionCo”, “Price”: “19.99”, ” Variation ID”: “301-A”, ” Color”: “Color: Blue”, ” Variation Details”: ” Size: S” }, { “Category”: ” Clothing”, “Item Name”: “T-Shirt”, “Item ID”: “301”, “Brand”: “FashionCo”, “Price”: “19.99”, ” Variation ID”: “301-B”, ” Color”: “Color: Red”, ” Variation Details”: ” Size: M” }, { “Category”: ” Clothing”, “Item Name”: “T-Shirt”, “Item ID”: “301”, “Brand”: “FashionCo”, “Price”: “19.99”, ” Variation ID”: “301-C”, ” Color”: “Color: Green”, ” Variation Details”: ” Size: L” }, { “Category”: ” Clothing”, “Item Name”: “Jeans”, “Item ID”: “302”, “Brand”: “DenimWorks”, “Price”: “49.99”, ” Variation ID”: “302-A”, ” Color”: “Color: Dark Blue”, ” Variation Details”: ” Size: 32” }, { “Category”: ” Clothing”, “Item Name”: “Jeans”, “Item ID”: “302”, “Brand”: “DenimWorks”, “Price”: “49.99”, ” Variation ID”: “302-B”, ” Color”: “Color: Light Blue”, ” Variation Details”: ” Size: 34” }, { “Category”: ” Books”, “Item Name”: “Fiction Novel”, “Item ID”: “401”, “Brand”: “-”, “Price”: “14.99”, ” Variation ID”: “401-A”, ” Color”: “Format: Hardcover”, ” Variation Details”: ” Language: English” }, { “Category”: ” Books”, “Item Name”: “Fiction Novel”, “Item ID”: “401”, “Brand”: “-”, “Price”: “14.99”, ” Variation ID”: “401-B”, ” Color”: “Format: Paperback”, ” Variation Details”: ” Language: Spanish” }, { “Category”: ” Books”, “Item Name”: “Non-Fiction Guide”, “Item ID”: “402”, “Brand”: “-”, “Price”: “24.99”, ” Variation ID”: “402-A”, ” Color”: “Format: eBook”, ” Variation Details”: ” Language: English” }, { “Category”: ” Books”, “Item Name”: “Non-Fiction Guide”, “Item ID”: “402”, “Brand”: “-”, “Price”: “24.99”, ” Variation ID”: “402-B”, ” Color”: “Format: Paperback”, ” Variation Details”: ” Language: French” }, { “Category”: ” Sports Equipment”, “Item Name”: “Basketball”, “Item ID”: “501”, “Brand”: “SportsGear”, “Price”: “29.99”, ” Variation ID”: “501-A”, ” Color”: “Size: Size 7”, ” Variation Details”: ” Color: Orange” }, { “Category”: ” Sports Equipment”, “Item Name”: “Tennis Racket”, “Item ID”: “502”, “Brand”: “RacketPro”, “Price”: “89.99”, ” Variation ID”: “502-A”, ” Color”: “Material: Graphite”, ” Variation Details”: ” Color: Black” }, { “Category”: ” Sports Equipment”, “Item Name”: “Tennis Racket”, “Item ID”: “502”, “Brand”: “RacketPro”, “Price”: “89.99”, ” Variation ID”: “502-B”, ” Color”: “Material: Aluminum”, ” Variation Details”: ” Color: Silver” }]
dataJson <- read_json("https://raw.githubusercontent.com/asadny82/Data607/refs/heads/main/week7Assignment.json")
dataJson
## [[1]]
## [[1]]$Category
## [1] " Electronics"
##
## [[1]]$`Item Name`
## [1] " Smartphone"
##
## [[1]]$`Item ID`
## [1] "101"
##
## [[1]]$Brand
## [1] " TechBrand"
##
## [[1]]$Price
## [1] "699.99"
##
## [[1]]$` Variation ID`
## [1] " 101-A"
##
## [[1]]$` Color`
## [1] "Black"
##
## [[1]]$` Variation Details`
## [1] " Storage: 64GB"
##
##
## [[2]]
## [[2]]$Category
## [1] " Electronics"
##
## [[2]]$`Item Name`
## [1] "Smartphone"
##
## [[2]]$`Item ID`
## [1] "101"
##
## [[2]]$Brand
## [1] "TechBrand"
##
## [[2]]$Price
## [1] "699.99"
##
## [[2]]$` Variation ID`
## [1] "101-B"
##
## [[2]]$` Color`
## [1] "White"
##
## [[2]]$` Variation Details`
## [1] " Storage: 128GB"
##
##
## [[3]]
## [[3]]$Category
## [1] " Electronics"
##
## [[3]]$`Item Name`
## [1] "Laptop"
##
## [[3]]$`Item ID`
## [1] "102"
##
## [[3]]$Brand
## [1] "CompuBrand"
##
## [[3]]$Price
## [1] "1099.99"
##
## [[3]]$` Variation ID`
## [1] "102-A"
##
## [[3]]$` Color`
## [1] "Silver"
##
## [[3]]$` Variation Details`
## [1] " Storage: 256GB"
##
##
## [[4]]
## [[4]]$Category
## [1] " Electronics"
##
## [[4]]$`Item Name`
## [1] "Laptop"
##
## [[4]]$`Item ID`
## [1] "102"
##
## [[4]]$Brand
## [1] "CompuBrand"
##
## [[4]]$Price
## [1] "1099.99"
##
## [[4]]$` Variation ID`
## [1] "102-B"
##
## [[4]]$` Color`
## [1] "Space Gray"
##
## [[4]]$` Variation Details`
## [1] " Storage: 512GB"
##
##
## [[5]]
## [[5]]$Category
## [1] " Home Appliances"
##
## [[5]]$`Item Name`
## [1] "Refrigerator"
##
## [[5]]$`Item ID`
## [1] "201"
##
## [[5]]$Brand
## [1] "HomeCool"
##
## [[5]]$Price
## [1] "899.99"
##
## [[5]]$` Variation ID`
## [1] "201-A"
##
## [[5]]$` Color`
## [1] "Stainless Steel"
##
## [[5]]$` Variation Details`
## [1] " Capacity: 20 cu ft"
##
##
## [[6]]
## [[6]]$Category
## [1] " Home Appliances"
##
## [[6]]$`Item Name`
## [1] "Refrigerator"
##
## [[6]]$`Item ID`
## [1] "201"
##
## [[6]]$Brand
## [1] "HomeCool"
##
## [[6]]$Price
## [1] "899.99"
##
## [[6]]$` Variation ID`
## [1] "201-B"
##
## [[6]]$` Color`
## [1] "White"
##
## [[6]]$` Variation Details`
## [1] " Capacity: 18 cu ft"
##
##
## [[7]]
## [[7]]$Category
## [1] " Home Appliances"
##
## [[7]]$`Item Name`
## [1] "Washing Machine"
##
## [[7]]$`Item ID`
## [1] "202"
##
## [[7]]$Brand
## [1] "CleanTech"
##
## [[7]]$Price
## [1] "499.99"
##
## [[7]]$` Variation ID`
## [1] "202-A"
##
## [[7]]$` Color`
## [1] "Front Load"
##
## [[7]]$` Variation Details`
## [1] " Capacity: 4.5 cu ft"
##
##
## [[8]]
## [[8]]$Category
## [1] " Home Appliances"
##
## [[8]]$`Item Name`
## [1] "Washing Machine"
##
## [[8]]$`Item ID`
## [1] "202"
##
## [[8]]$Brand
## [1] "CleanTech"
##
## [[8]]$Price
## [1] "499.99"
##
## [[8]]$` Variation ID`
## [1] "202-B"
##
## [[8]]$` Color`
## [1] "Top Load"
##
## [[8]]$` Variation Details`
## [1] " Capacity: 5.0 cu ft"
##
##
## [[9]]
## [[9]]$Category
## [1] " Clothing"
##
## [[9]]$`Item Name`
## [1] "T-Shirt"
##
## [[9]]$`Item ID`
## [1] "301"
##
## [[9]]$Brand
## [1] "FashionCo"
##
## [[9]]$Price
## [1] "19.99"
##
## [[9]]$` Variation ID`
## [1] "301-A"
##
## [[9]]$` Color`
## [1] "Blue"
##
## [[9]]$` Variation Details`
## [1] " Size: S"
##
##
## [[10]]
## [[10]]$Category
## [1] " Clothing"
##
## [[10]]$`Item Name`
## [1] "T-Shirt"
##
## [[10]]$`Item ID`
## [1] "301"
##
## [[10]]$Brand
## [1] "FashionCo"
##
## [[10]]$Price
## [1] "19.99"
##
## [[10]]$` Variation ID`
## [1] "301-B"
##
## [[10]]$` Color`
## [1] "Red"
##
## [[10]]$` Variation Details`
## [1] " Size: M"
##
##
## [[11]]
## [[11]]$Category
## [1] " Clothing"
##
## [[11]]$`Item Name`
## [1] "T-Shirt"
##
## [[11]]$`Item ID`
## [1] "301"
##
## [[11]]$Brand
## [1] "FashionCo"
##
## [[11]]$Price
## [1] "19.99"
##
## [[11]]$` Variation ID`
## [1] "301-C"
##
## [[11]]$` Color`
## [1] "Green"
##
## [[11]]$` Variation Details`
## [1] " Size: L"
##
##
## [[12]]
## [[12]]$Category
## [1] " Clothing"
##
## [[12]]$`Item Name`
## [1] "Jeans"
##
## [[12]]$`Item ID`
## [1] "302"
##
## [[12]]$Brand
## [1] "DenimWorks"
##
## [[12]]$Price
## [1] "49.99"
##
## [[12]]$` Variation ID`
## [1] "302-A"
##
## [[12]]$` Color`
## [1] "Dark Blue"
##
## [[12]]$` Variation Details`
## [1] " Size: 32"
##
##
## [[13]]
## [[13]]$Category
## [1] " Clothing"
##
## [[13]]$`Item Name`
## [1] "Jeans"
##
## [[13]]$`Item ID`
## [1] "302"
##
## [[13]]$Brand
## [1] "DenimWorks"
##
## [[13]]$Price
## [1] "49.99"
##
## [[13]]$` Variation ID`
## [1] "302-B"
##
## [[13]]$` Color`
## [1] "Light Blue"
##
## [[13]]$` Variation Details`
## [1] " Size: 34"
##
##
## [[14]]
## [[14]]$Category
## [1] " Books"
##
## [[14]]$`Item Name`
## [1] "Fiction Novel"
##
## [[14]]$`Item ID`
## [1] "401"
##
## [[14]]$Brand
## [1] "-"
##
## [[14]]$Price
## [1] "14.99"
##
## [[14]]$` Variation ID`
## [1] "401-A"
##
## [[14]]$` Color`
## [1] "Hardcover"
##
## [[14]]$` Variation Details`
## [1] " Language: English"
##
##
## [[15]]
## [[15]]$Category
## [1] " Books"
##
## [[15]]$`Item Name`
## [1] "Fiction Novel"
##
## [[15]]$`Item ID`
## [1] "401"
##
## [[15]]$Brand
## [1] "-"
##
## [[15]]$Price
## [1] "14.99"
##
## [[15]]$` Variation ID`
## [1] "401-B"
##
## [[15]]$` Color`
## [1] "Paperback"
##
## [[15]]$` Variation Details`
## [1] " Language: Spanish"
##
##
## [[16]]
## [[16]]$Category
## [1] " Books"
##
## [[16]]$`Item Name`
## [1] "Non-Fiction Guide"
##
## [[16]]$`Item ID`
## [1] "402"
##
## [[16]]$Brand
## [1] "-"
##
## [[16]]$Price
## [1] "24.99"
##
## [[16]]$` Variation ID`
## [1] "402-A"
##
## [[16]]$` Color`
## [1] "eBook"
##
## [[16]]$` Variation Details`
## [1] " Language: English"
##
##
## [[17]]
## [[17]]$Category
## [1] " Books"
##
## [[17]]$`Item Name`
## [1] "Non-Fiction Guide"
##
## [[17]]$`Item ID`
## [1] "402"
##
## [[17]]$Brand
## [1] "-"
##
## [[17]]$Price
## [1] "24.99"
##
## [[17]]$` Variation ID`
## [1] "402-B"
##
## [[17]]$` Color`
## [1] "Paperback"
##
## [[17]]$` Variation Details`
## [1] " Language: French"
##
##
## [[18]]
## [[18]]$Category
## [1] " Sports Equipment"
##
## [[18]]$`Item Name`
## [1] "Basketball"
##
## [[18]]$`Item ID`
## [1] "501"
##
## [[18]]$Brand
## [1] "SportsGear"
##
## [[18]]$Price
## [1] "29.99"
##
## [[18]]$` Variation ID`
## [1] "501-A"
##
## [[18]]$` Color`
## [1] "Orange"
##
## [[18]]$` Variation Details`
## [1] "Size: Size 7"
##
##
## [[19]]
## [[19]]$Category
## [1] " Sports Equipment"
##
## [[19]]$`Item Name`
## [1] "Tennis Racket"
##
## [[19]]$`Item ID`
## [1] "502"
##
## [[19]]$Brand
## [1] "RacketPro"
##
## [[19]]$Price
## [1] "89.99"
##
## [[19]]$` Variation ID`
## [1] "502-A"
##
## [[19]]$` Color`
## [1] "Black"
##
## [[19]]$` Variation Details`
## [1] "Material: Graphite"
##
##
## [[20]]
## [[20]]$Category
## [1] " Sports Equipment"
##
## [[20]]$`Item Name`
## [1] "Tennis Racket"
##
## [[20]]$`Item ID`
## [1] "502"
##
## [[20]]$Brand
## [1] "RacketPro"
##
## [[20]]$Price
## [1] "89.99"
##
## [[20]]$` Variation ID`
## [1] "502-B"
##
## [[20]]$` Color`
## [1] "Silver "
##
## [[20]]$` Variation Details`
## [1] "Material: Aluminum"
json_dirty <- sapply(dataJson, `[`)
knitr::kable(json_dirty)
Category | Electronics | Electronics | Electronics | Electronics | Home Appliances | Home Appliances | Home Appliances | Home Appliances | Clothing | Clothing | Clothing | Clothing | Clothing | Books | Books | Books | Books | Sports Equipment | Sports Equipment | Sports Equipment |
Item Name | Smartphone | Smartphone | Laptop | Laptop | Refrigerator | Refrigerator | Washing Machine | Washing Machine | T-Shirt | T-Shirt | T-Shirt | Jeans | Jeans | Fiction Novel | Fiction Novel | Non-Fiction Guide | Non-Fiction Guide | Basketball | Tennis Racket | Tennis Racket |
Item ID | 101 | 101 | 102 | 102 | 201 | 201 | 202 | 202 | 301 | 301 | 301 | 302 | 302 | 401 | 401 | 402 | 402 | 501 | 502 | 502 |
Brand | TechBrand | TechBrand | CompuBrand | CompuBrand | HomeCool | HomeCool | CleanTech | CleanTech | FashionCo | FashionCo | FashionCo | DenimWorks | DenimWorks | - | - | - | - | SportsGear | RacketPro | RacketPro |
Price | 699.99 | 699.99 | 1099.99 | 1099.99 | 899.99 | 899.99 | 499.99 | 499.99 | 19.99 | 19.99 | 19.99 | 49.99 | 49.99 | 14.99 | 14.99 | 24.99 | 24.99 | 29.99 | 89.99 | 89.99 |
Variation ID | 101-A | 101-B | 102-A | 102-B | 201-A | 201-B | 202-A | 202-B | 301-A | 301-B | 301-C | 302-A | 302-B | 401-A | 401-B | 402-A | 402-B | 501-A | 502-A | 502-B |
Color | Black | White | Silver | Space Gray | Stainless Steel | White | Front Load | Top Load | Blue | Red | Green | Dark Blue | Light Blue | Hardcover | Paperback | eBook | Paperback | Orange | Black | Silver |
Variation Details | Storage: 64GB | Storage: 128GB | Storage: 256GB | Storage: 512GB | Capacity: 20 cu ft | Capacity: 18 cu ft | Capacity: 4.5 cu ft | Capacity: 5.0 cu ft | Size: S | Size: M | Size: L | Size: 32 | Size: 34 | Language: English | Language: Spanish | Language: English | Language: French | Size: Size 7 | Material: Graphite | Material: Aluminum |
url <- getURL("https://raw.githubusercontent.com/asadny82/Data607/refs/heads/main/week7Assignment.json")
dataJson <- url %>%
fromJSON() %>%
as.data.frame()
dataJson
## Category Item Name Item ID Brand Price Variation ID
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9 Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10 Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11 Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12 Clothing Jeans 302 DenimWorks 49.99 302-A
## 13 Clothing Jeans 302 DenimWorks 49.99 302-B
## 14 Books Fiction Novel 401 - 14.99 401-A
## 15 Books Fiction Novel 401 - 14.99 401-B
## 16 Books Non-Fiction Guide 402 - 24.99 402-A
## 17 Books Non-Fiction Guide 402 - 24.99 402-B
## 18 Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## Color Variation Details
## 1 Black Storage: 64GB
## 2 White Storage: 128GB
## 3 Silver Storage: 256GB
## 4 Space Gray Storage: 512GB
## 5 Stainless Steel Capacity: 20 cu ft
## 6 White Capacity: 18 cu ft
## 7 Front Load Capacity: 4.5 cu ft
## 8 Top Load Capacity: 5.0 cu ft
## 9 Blue Size: S
## 10 Red Size: M
## 11 Green Size: L
## 12 Dark Blue Size: 32
## 13 Light Blue Size: 34
## 14 Hardcover Language: English
## 15 Paperback Language: Spanish
## 16 eBook Language: English
## 17 Paperback Language: French
## 18 Orange Size: Size 7
## 19 Black Material: Graphite
## 20 Silver Material: Aluminum
str(dataJson)
## 'data.frame': 20 obs. of 8 variables:
## $ Category : chr " Electronics" " Electronics" " Electronics" " Electronics" ...
## $ Item Name : chr " Smartphone" "Smartphone" "Laptop" "Laptop" ...
## $ Item ID : chr "101" "101" "102" "102" ...
## $ Brand : chr " TechBrand" "TechBrand" "CompuBrand" "CompuBrand" ...
## $ Price : chr "699.99" "699.99" "1099.99" "1099.99" ...
## $ Variation ID : chr " 101-A" "101-B" "102-A" "102-B" ...
## $ Color : chr "Black" "White" "Silver" "Space Gray" ...
## $ Variation Details: chr " Storage: 64GB" " Storage: 128GB" " Storage: 256GB" " Storage: 512GB" ...
XML:
XmlUrl <- getURL('https://raw.githubusercontent.com/asadny82/Data607/refs/heads/main/week7.xml')
data_XML <- XmlUrl %>%
xmlParse() %>%
xmlRoot()
data_XML
## <CUNYMart>
## <Category id="1">
## <Category> Electronics</Category>
## <ItemName> Smartphone</ItemName>
## <ItemID>101</ItemID>
## <Brand> TechBrand</Brand>
## <Price>699.99</Price>
## <VariationID> 101-A</VariationID>
## <Color>Black</Color>
## <VariationDetails> Storage: 64GB</VariationDetails>
## </Category>
## <Category id="2">
## <Category> Electronics</Category>
## <ItemName>Smartphone</ItemName>
## <ItemID>101</ItemID>
## <Brand>TechBrand</Brand>
## <Price>699.99</Price>
## <VariationID>101-B</VariationID>
## <Color> White</Color>
## <VariationDetails> Storage: 128GB</VariationDetails>
## </Category>
## <Category id="3">
## <Category> Electronics</Category>
## <ItemName>Laptop</ItemName>
## <ItemID>102</ItemID>
## <Brand>CompuBrand</Brand>
## <Price>1099.99</Price>
## <VariationID>102-A</VariationID>
## <Color>Silver</Color>
## <VariationDetails> Storage: 256GB</VariationDetails>
## </Category>
## <Category id="4">
## <Category> Electronics</Category>
## <ItemName>Laptop</ItemName>
## <ItemID>102</ItemID>
## <Brand>CompuBrand</Brand>
## <Price>1099.99</Price>
## <VariationID>102-B</VariationID>
## <Color>Space Gray</Color>
## <VariationDetails> Storage: 512GB</VariationDetails>
## </Category>
## <Category id="5">
## <Category> Home Appliances</Category>
## <ItemName>Refrigerator</ItemName>
## <ItemID>201</ItemID>
## <Brand>HomeCool</Brand>
## <Price>899.99</Price>
## <VariationID>201-A</VariationID>
## <Color/>
## <VariationDetails> Capacity: 20 cu ft</VariationDetails>
## </Category>
## <Category id="6">
## <Category> Home Appliances</Category>
## <ItemName>Refrigerator</ItemName>
## <ItemID>201</ItemID>
## <Brand>HomeCool</Brand>
## <Price>899.99</Price>
## <VariationID>201-B</VariationID>
## <Color>White</Color>
## <VariationDetails> Capacity: 18 cu ft</VariationDetails>
## </Category>
## <Category id="7">
## <Category> Home Appliances</Category>
## <ItemName>Washing Machine</ItemName>
## <ItemID>202</ItemID>
## <Brand>CleanTech</Brand>
## <Price>499.99</Price>
## <VariationID>202-A</VariationID>
## <Color/>
## <VariationDetails> Capacity: 4.5 cu ft</VariationDetails>
## </Category>
## <Category id="8">
## <Category> Home Appliances</Category>
## <ItemName>Washing Machine</ItemName>
## <ItemID>202</ItemID>
## <Brand>CleanTech</Brand>
## <Price>499.99</Price>
## <VariationID>202-B</VariationID>
## <Color/>
## <VariationDetails> Capacity: 5.0 cu ft</VariationDetails>
## </Category>
## <Category id="9">
## <Category> Clothing</Category>
## <ItemName>T-Shirt</ItemName>
## <ItemID>301</ItemID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <VariationID>301-A</VariationID>
## <Color>Blue</Color>
## <VariationDetails> Size: S</VariationDetails>
## </Category>
## <Category id="10">
## <Category> Clothing</Category>
## <ItemName>T-Shirt</ItemName>
## <ItemID>301</ItemID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <VariationID>301-B</VariationID>
## <Color>Red</Color>
## <VariationDetails> Size: M</VariationDetails>
## </Category>
## <Category id="11">
## <Category> Clothing</Category>
## <ItemName>T-Shirt</ItemName>
## <ItemID>301</ItemID>
## <Brand>FashionCo</Brand>
## <Price>19.99</Price>
## <VariationID>301-C</VariationID>
## <Color>Green</Color>
## <VariationDetails> Size: L</VariationDetails>
## </Category>
## <Category id="12">
## <Category> Clothing</Category>
## <ItemName>Jeans</ItemName>
## <ItemID>302</ItemID>
## <Brand>DenimWorks</Brand>
## <Price>49.99</Price>
## <VariationID>302-A</VariationID>
## <Color>Dark Blue</Color>
## <VariationDetails> Size: 32</VariationDetails>
## </Category>
## <Category id="13">
## <Category> Clothing</Category>
## <ItemName>Jeans</ItemName>
## <ItemID>302</ItemID>
## <Brand>DenimWorks</Brand>
## <Price>49.99</Price>
## <VariationID>302-B</VariationID>
## <Color>Light Blue</Color>
## <VariationDetails> Size: 34</VariationDetails>
## </Category>
## <Category id="14">
## <Category> Books</Category>
## <ItemName>Fiction Novel</ItemName>
## <ItemID>401</ItemID>
## <Brand>-</Brand>
## <Price>14.99</Price>
## <VariationID>401-A</VariationID>
## <Color/>
## <VariationDetails> Language: English</VariationDetails>
## </Category>
## <Category id="15">
## <Category> Books</Category>
## <ItemName>Fiction Novel</ItemName>
## <ItemID>401</ItemID>
## <Brand>-</Brand>
## <Price>14.99</Price>
## <VariationID>401-B</VariationID>
## <Color/>
## <VariationDetails> Language: Spanish</VariationDetails>
## </Category>
## <Category id="16">
## <Category> Books</Category>
## <ItemName>Non-Fiction Guide</ItemName>
## <ItemID>402</ItemID>
## <Brand>-</Brand>
## <Price>24.99</Price>
## <VariationID>402-A</VariationID>
## <Color/>
## <VariationDetails> Language: English</VariationDetails>
## </Category>
## <Category id="17">
## <Category> Books</Category>
## <ItemName>Non-Fiction Guide</ItemName>
## <ItemID>402</ItemID>
## <Brand>-</Brand>
## <Price>24.99</Price>
## <VariationID>402-B</VariationID>
## <Color/>
## <VariationDetails> Language: French</VariationDetails>
## </Category>
## <Category id="18">
## <Category> Sports Equipment</Category>
## <ItemName>Basketball</ItemName>
## <ItemID>501</ItemID>
## <Brand>SportsGear</Brand>
## <Price>29.99</Price>
## <VariationID>501-A</VariationID>
## <Color>Orange</Color>
## <VariationDetails> Size 7</VariationDetails>
## </Category>
## <Category id="19">
## <Category> Sports Equipment</Category>
## <ItemName>Tennis Racket</ItemName>
## <ItemID>502</ItemID>
## <Brand>RacketPro</Brand>
## <Price>89.99</Price>
## <VariationID>502-A</VariationID>
## <Color>Black</Color>
## <VariationDetails>Material: Graphite</VariationDetails>
## </Category>
## <Category id="20">
## <Category> Sports Equipment</Category>
## <ItemName>Tennis Racket</ItemName>
## <ItemID>502</ItemID>
## <Brand>RacketPro</Brand>
## <Price>89.99</Price>
## <VariationID>502-B</VariationID>
## <Color>Silver</Color>
## <VariationDetails>Aluminum</VariationDetails>
## </Category>
## </CUNYMart>
XmlUrl <- getURL('https://raw.githubusercontent.com/asadny82/Data607/refs/heads/main/week7.xml')
data_XML <- XmlUrl %>%
xmlParse() %>%
xmlRoot() %>%
xmlToDataFrame(stringsAsFactors = FALSE)
data_XML
## Category ItemName ItemID Brand Price VariationID
## 1 Electronics Smartphone 101 TechBrand 699.99 101-A
## 2 Electronics Smartphone 101 TechBrand 699.99 101-B
## 3 Electronics Laptop 102 CompuBrand 1099.99 102-A
## 4 Electronics Laptop 102 CompuBrand 1099.99 102-B
## 5 Home Appliances Refrigerator 201 HomeCool 899.99 201-A
## 6 Home Appliances Refrigerator 201 HomeCool 899.99 201-B
## 7 Home Appliances Washing Machine 202 CleanTech 499.99 202-A
## 8 Home Appliances Washing Machine 202 CleanTech 499.99 202-B
## 9 Clothing T-Shirt 301 FashionCo 19.99 301-A
## 10 Clothing T-Shirt 301 FashionCo 19.99 301-B
## 11 Clothing T-Shirt 301 FashionCo 19.99 301-C
## 12 Clothing Jeans 302 DenimWorks 49.99 302-A
## 13 Clothing Jeans 302 DenimWorks 49.99 302-B
## 14 Books Fiction Novel 401 - 14.99 401-A
## 15 Books Fiction Novel 401 - 14.99 401-B
## 16 Books Non-Fiction Guide 402 - 24.99 402-A
## 17 Books Non-Fiction Guide 402 - 24.99 402-B
## 18 Sports Equipment Basketball 501 SportsGear 29.99 501-A
## 19 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-A
## 20 Sports Equipment Tennis Racket 502 RacketPro 89.99 502-B
## Color VariationDetails
## 1 Black Storage: 64GB
## 2 White Storage: 128GB
## 3 Silver Storage: 256GB
## 4 Space Gray Storage: 512GB
## 5 Capacity: 20 cu ft
## 6 White Capacity: 18 cu ft
## 7 Capacity: 4.5 cu ft
## 8 Capacity: 5.0 cu ft
## 9 Blue Size: S
## 10 Red Size: M
## 11 Green Size: L
## 12 Dark Blue Size: 32
## 13 Light Blue Size: 34
## 14 Language: English
## 15 Language: Spanish
## 16 Language: English
## 17 Language: French
## 18 Orange Size 7
## 19 Black Material: Graphite
## 20 Silver Aluminum
names(data_XML)[1] <- 'catagory'
names(data_XML)[2] <-'itemName'
names(data_XML)[3] <- 'itemID'
names(data_XML)[4] <- 'brand'
names(data_XML)[5] <- 'price'
names(data_XML)[6] <- 'variationID'
names(data_XML)[7] <- 'color'
names(data_XML)[8] <- 'variationDetails'
data_HTML
## # A tibble: 20 × 9
## # Groups: itemName [10]
## x catagory itemName itemID brand price variationID color
## <int> <chr> <chr> <int> <chr> <dbl> <chr> <chr>
## 1 1 Electronics Smartphone 101 Tech… 700. 101-A Black
## 2 2 Electronics Smartphone 101 Tech… 700. 101-B White
## 3 3 Electronics Laptop 102 Comp… 1100. 102-A Silv…
## 4 4 Electronics Laptop 102 Comp… 1100. 102-B Spac…
## 5 5 Home Appliances Refrigerator 201 Home… 900. 201-A Stai…
## 6 6 Home Appliances Refrigerator 201 Home… 900. 201-B White
## 7 7 Home Appliances Washing Machine 202 Clea… 500. 202-A Fron…
## 8 8 Home Appliances Washing Machine 202 Clea… 500. 202-B Type…
## 9 9 Clothing T-Shirt 301 Fash… 20.0 301-A Blue
## 10 10 Clothing T-Shirt 301 Fash… 20.0 301-B Red
## 11 11 Clothing T-Shirt 301 Fash… 20.0 301-C Green
## 12 12 Clothing Jeans 302 Deni… 50.0 302-A Dark…
## 13 13 Clothing Jeans 302 Deni… 50.0 302-B Ligh…
## 14 14 Books Fiction Novel 401 - 15.0 401-A Hard…
## 15 15 Books Fiction Novel 401 - 15.0 401-B Pape…
## 16 16 Books Non-Fiction Gui… 402 - 25.0 402-A eBook
## 17 17 Books Non-Fiction Gui… 402 - 25.0 402-B Form…
## 18 18 Sports Equipment Basketball 501 Spor… 30.0 501-A Size…
## 19 19 Sports Equipment Tennis Racket 502 Rack… 90.0 502-A Black
## 20 20 Sports Equipment Tennis Racket 502 Rack… 90.0 502-B Silv…
## # ℹ 1 more variable: variationDetails <chr>
data <- na.omit(data_XML) %>%
mutate(color = na_if(color,'')) %>%
fill(color, .direction = 'down') %>%
mutate(variationDetails = str_replace(variationDetails, 'Size: S','Storage: 64GB'),
variationDetails = str_replace(variationDetails, 'Size: M','Storage: 64GB')) %>%
group_by(itemName)
data
## # A tibble: 20 × 8
## # Groups: itemName [11]
## catagory itemName itemID brand price variationID color variationDetails
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 " Electronics" " Smart… 101 " Te… 699.… " 101-A" "Bla… " Storage: 64G…
## 2 " Electronics" "Smartp… 101 "Tec… 699.… "101-B" " Wh… " Storage: 128G…
## 3 " Electronics" "Laptop" 102 "Com… 1099… "102-A" "Sil… " Storage: 256G…
## 4 " Electronics" "Laptop" 102 "Com… 1099… "102-B" "Spa… " Storage: 512G…
## 5 " Home Applia… "Refrig… 201 "Hom… 899.… "201-A" "Spa… " Capacity: 20 …
## 6 " Home Applia… "Refrig… 201 "Hom… 899.… "201-B" "Whi… " Capacity: 18 …
## 7 " Home Applia… "Washin… 202 "Cle… 499.… "202-A" "Whi… " Capacity: 4.5…
## 8 " Home Applia… "Washin… 202 "Cle… 499.… "202-B" "Whi… " Capacity: 5.0…
## 9 " Clothing" "T-Shir… 301 "Fas… 19.99 "301-A" "Blu… " Storage: 64GB"
## 10 " Clothing" "T-Shir… 301 "Fas… 19.99 "301-B" "Red" " Size: M"
## 11 " Clothing" "T-Shir… 301 "Fas… 19.99 "301-C" "Gre… " Size: L"
## 12 " Clothing" "Jeans" 302 "Den… 49.99 "302-A" "Dar… " Size: 32"
## 13 " Clothing" "Jeans" 302 "Den… 49.99 "302-B" "Lig… " Size: 34"
## 14 " Books" "Fictio… 401 "-" 14.99 "401-A" "Lig… " Language: Eng…
## 15 " Books" "Fictio… 401 "-" 14.99 "401-B" "Lig… " Language: Spa…
## 16 " Books" "Non-Fi… 402 "-" 24.99 "402-A" "Lig… " Language: Eng…
## 17 " Books" "Non-Fi… 402 "-" 24.99 "402-B" "Lig… " Language: Fre…
## 18 " Sports Equi… "Basket… 501 "Spo… 29.99 "501-A" "Ora… " Size 7"
## 19 " Sports Equi… "Tennis… 502 "Rac… 89.99 "502-A" "Bla… "Material: Grap…
## 20 " Sports Equi… "Tennis… 502 "Rac… 89.99 "502-B" "Sil… "Aluminum"
data <- data %>%
gather('Item','product', 2:3) %>%
spread(Item, product)
data
## # A tibble: 20 × 8
## catagory brand price variationID color variationDetails itemID itemName
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 " Books" "-" 14.99 "401-A" "Lig… " Language: Eng… 401 "Fictio…
## 2 " Books" "-" 14.99 "401-B" "Lig… " Language: Spa… 401 "Fictio…
## 3 " Books" "-" 24.99 "402-A" "Lig… " Language: Eng… 402 "Non-Fi…
## 4 " Books" "-" 24.99 "402-B" "Lig… " Language: Fre… 402 "Non-Fi…
## 5 " Clothing" "Den… 49.99 "302-A" "Dar… " Size: 32" 302 "Jeans"
## 6 " Clothing" "Den… 49.99 "302-B" "Lig… " Size: 34" 302 "Jeans"
## 7 " Clothing" "Fas… 19.99 "301-A" "Blu… " Storage: 64GB" 301 "T-Shir…
## 8 " Clothing" "Fas… 19.99 "301-B" "Red" " Size: M" 301 "T-Shir…
## 9 " Clothing" "Fas… 19.99 "301-C" "Gre… " Size: L" 301 "T-Shir…
## 10 " Electronics" " Te… 699.… " 101-A" "Bla… " Storage: 64G… 101 " Smart…
## 11 " Electronics" "Com… 1099… "102-A" "Sil… " Storage: 256G… 102 "Laptop"
## 12 " Electronics" "Com… 1099… "102-B" "Spa… " Storage: 512G… 102 "Laptop"
## 13 " Electronics" "Tec… 699.… "101-B" " Wh… " Storage: 128G… 101 "Smartp…
## 14 " Home Applia… "Cle… 499.… "202-A" "Whi… " Capacity: 4.5… 202 "Washin…
## 15 " Home Applia… "Cle… 499.… "202-B" "Whi… " Capacity: 5.0… 202 "Washin…
## 16 " Home Applia… "Hom… 899.… "201-A" "Spa… " Capacity: 20 … 201 "Refrig…
## 17 " Home Applia… "Hom… 899.… "201-B" "Whi… " Capacity: 18 … 201 "Refrig…
## 18 " Sports Equi… "Rac… 89.99 "502-A" "Bla… "Material: Grap… 502 "Tennis…
## 19 " Sports Equi… "Rac… 89.99 "502-B" "Sil… "Aluminum" 502 "Tennis…
## 20 " Sports Equi… "Spo… 29.99 "501-A" "Ora… " Size 7" 501 "Basket…
The pros and cons of each format:
It is possible to load each file from the remote source but RCurl was used for HTML and XML files and the JSON file was imported directly with the JSON function. Also, each file is loaded in a slightly different way and requires some manual effort to create a data frame. The HTML file needs to be converted to numbers. It had to be converted from a wide to a long format and unnested from there. The XML file was automatically imported as an XML object. Had to extract data using xmlParse, xmlRoot,and xmlToDataFrame. These three data frames are almost identical. There is a difference when parsing numeric values from source files to R data frames. The html_table function from the package automatically parses numbers as numeric values and must use xmlToDataFrame in XML.