Using the pdftools library, I imported the text directly from the
assignment PDF before using read_csv
to read the
comma-separated text into a data frame, dropping any unnecessary
rows.
library(pdftools)
raw_text <- pdf_text("File_Formats_Assignments.pdf")
table_text <- read_csv(file = raw_text, skip = 4, show_col_types = FALSE)
table_text <- table_text[-c(24:27),]
The only columns with data in rows 6, 9 and 11 had been split from the last column of each previous row and needed to be appended before deletion.
Then I converted the dashes in the Brand column to NA
values.
table_text[5, 7] <- paste(table_text[5, 7], table_text[6, 1], sep=" ")
table_text[8, 7] <- paste(table_text[8, 7], table_text[9, 1], sep=" ")
table_text[10, 7] <- paste(table_text[10, 7], table_text[11, 1], sep=" ")
table_text <- table_text[-c(6, 9, 11), ]
table_text$Brand[table_text$Brand == "-"] <- NA
Below, I used thew following libraries to convert the dataframe into
their respective formats: rjson
, htmlTable
,
xml2
and arrow
.
JSON: As a front-end software engineer, I have had the most experience with this format. It’s fairly easy to read and very popular, with plenty of available tools to parse, convert, etc. Most APIs return data in this format, and so it can be used across many languages and libraries. But since JSON basically comes in one long string, it’s pretty inefficient and lacks structure.
HTML: This is the basic structure of basically every web page. It’s not meant for data transmission, but for information display. It’s inevitable if web scraping though.
XML: Like HTML, XML uses tags, which makes it inefficient and full of strings that may need to be parsed out. But since these tags can be customized, there can be more structure and definition to the data.
PARQUET: Parquet is a columnar format, designed for efficiency and scalability with very large sets of data and can be used across platforms. However, once converted, the data is no longer readable by humans.
library(rjson)
## Warning: package 'rjson' was built under R version 4.4.1
json_table <- toJSON(table_text)
print(json_table, type = "json")
## [1] "{\"Category\":[\"Electronics\",\"Electronics\",\"Electronics\",\"Electronics\",\"Home Appliances\",\"Home Appliances\",\"Home Appliances\",\"Home Appliances\",\"Clothing\",\"Clothing\",\"Clothing\",\"Clothing\",\"Clothing\",\"Books\",\"Books\",\"Books\",\"Books\",\"Sports Equipment\",\"Sports Equipment\",\"Sports Equipment\"],\"Item Name\":[\"Smartphone\",\"Smartphone\",\"Laptop\",\"Laptop\",\"Refrigerator\",\"Refrigerator\",\"Washing Machine\",\"Washing Machine\",\"T-Shirt\",\"T-Shirt\",\"T-Shirt\",\"Jeans\",\"Jeans\",\"Fiction Novel\",\"Fiction Novel\",\"Non-Fiction Guide\",\"Non-Fiction Guide\",\"Basketball\",\"Tennis Racket\",\"Tennis Racket\"],\"Item ID\":[\"101\",\"101\",\"102\",\"102\",\"201\",\"201\",\"202\",\"202\",\"301\",\"301\",\"301\",\"302\",\"302\",\"401\",\"401\",\"402\",\"402\",\"501\",\"502\",\"502\"],\"Brand\":[\"TechBrand\",\"TechBrand\",\"CompuBrand\",\"CompuBrand\",\"HomeCool\",\"HomeCool\",\"CleanTech\",\"CleanTech\",\"FashionCo\",\"FashionCo\",\"FashionCo\",\"DenimWorks\",\"DenimWorks\",\"NA\",\"NA\",\"NA\",\"NA\",\"SportsGear\",\"RacketPro\",\"RacketPro\"],\"Price\":[\"699.99\",\"699.99\",\"1099.99\",\"1099.99\",\"899.99\",\"899.99\",\"499.99\",\"499.99\",\"19.99\",\"19.99\",\"19.99\",\"49.99\",\"49.99\",\"14.99\",\"14.99\",\"24.99\",\"24.99\",\"29.99\",\"89.99\",\"89.99\"],\"Variation ID\":[\"101-A\",\"101-B\",\"102-A\",\"102-B\",\"201-A\",\"201-B\",\"202-A\",\"202-B\",\"301-A\",\"301-B\",\"301-C\",\"302-A\",\"302-B\",\"401-A\",\"401-B\",\"402-A\",\"402-B\",\"501-A\",\"502-A\",\"502-B\"],\"Variation Details\":[\"Color: Black, Storage: 64GB\",\"Color: White, Storage: 128GB\",\"Color: Silver, Storage: 256GB\",\"Color: Space Gray, Storage: 512GB\",\"Color: Stainless Steel, Capacity: 20 cu ft\",\"Color: White, Capacity: 18 cu ft\",\"Type: Front Load, Capacity: 4.5 cu ft\",\"Type: Top Load, Capacity: 5.0 cu ft\",\"Color: Blue, Size: S\",\"Color: Red, Size: M\",\"Color: Green, Size: L\",\"Color: Dark Blue, Size: 32\",\"Color: Light Blue, Size: 34\",\"Format: Hardcover, Language: English\",\"Format: Paperback, Language: Spanish\",\"Format: eBook, Language: English\",\"Format: Paperback, Language: French\",\"Size: Size 7, Color: Orange\",\"Material: Graphite, Color: Black\",\"Material: Aluminum, Color: Silver\"]}"
library(htmlTable)
html_table <- htmlTable(table_text)
print(head(html_table))
## [1] "<table class='gmisc_table' style='border-collapse: collapse; margin-top: 1em; margin-bottom: 1em;' >\n<thead>\n<tr><th style='border-bottom: 1px solid grey; border-top: 2px solid grey;'></th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Category</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Item Name</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Item ID</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Brand</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Price</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Variation ID</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Variation Details</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td style='text-align: left;'>1</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Smartphone</td>\n<td style='text-align: center;'>101</td>\n<td style='text-align: center;'>TechBrand</td>\n<td style='text-align: center;'>699.99</td>\n<td style='text-align: center;'>101-A</td>\n<td style='text-align: center;'>Color: Black, Storage: 64GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>2</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Smartphone</td>\n<td style='text-align: center;'>101</td>\n<td style='text-align: center;'>TechBrand</td>\n<td style='text-align: center;'>699.99</td>\n<td style='text-align: center;'>101-B</td>\n<td style='text-align: center;'>Color: White, Storage: 128GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>3</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Laptop</td>\n<td style='text-align: center;'>102</td>\n<td style='text-align: center;'>CompuBrand</td>\n<td style='text-align: center;'>1099.99</td>\n<td style='text-align: center;'>102-A</td>\n<td style='text-align: center;'>Color: Silver, Storage: 256GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>4</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Laptop</td>\n<td style='text-align: center;'>102</td>\n<td style='text-align: center;'>CompuBrand</td>\n<td style='text-align: center;'>1099.99</td>\n<td style='text-align: center;'>102-B</td>\n<td style='text-align: center;'>Color: Space Gray, Storage: 512GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>5</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Refrigerator</td>\n<td style='text-align: center;'>201</td>\n<td style='text-align: center;'>HomeCool</td>\n<td style='text-align: center;'>899.99</td>\n<td style='text-align: center;'>201-A</td>\n<td style='text-align: center;'>Color: Stainless Steel, Capacity: 20 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>6</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Refrigerator</td>\n<td style='text-align: center;'>201</td>\n<td style='text-align: center;'>HomeCool</td>\n<td style='text-align: center;'>899.99</td>\n<td style='text-align: center;'>201-B</td>\n<td style='text-align: center;'>Color: White, Capacity: 18 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>7</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Washing Machine</td>\n<td style='text-align: center;'>202</td>\n<td style='text-align: center;'>CleanTech</td>\n<td style='text-align: center;'>499.99</td>\n<td style='text-align: center;'>202-A</td>\n<td style='text-align: center;'>Type: Front Load, Capacity: 4.5 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>8</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Washing Machine</td>\n<td style='text-align: center;'>202</td>\n<td style='text-align: center;'>CleanTech</td>\n<td style='text-align: center;'>499.99</td>\n<td style='text-align: center;'>202-B</td>\n<td style='text-align: center;'>Type: Top Load, Capacity: 5.0 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>9</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>T-Shirt</td>\n<td style='text-align: center;'>301</td>\n<td style='text-align: center;'>FashionCo</td>\n<td style='text-align: center;'>19.99</td>\n<td style='text-align: center;'>301-A</td>\n<td style='text-align: center;'>Color: Blue, Size: S</td>\n</tr>\n<tr>\n<td style='text-align: left;'>10</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>T-Shirt</td>\n<td style='text-align: center;'>301</td>\n<td style='text-align: center;'>FashionCo</td>\n<td style='text-align: center;'>19.99</td>\n<td style='text-align: center;'>301-B</td>\n<td style='text-align: center;'>Color: Red, Size: M</td>\n</tr>\n<tr>\n<td style='text-align: left;'>11</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>T-Shirt</td>\n<td style='text-align: center;'>301</td>\n<td style='text-align: center;'>FashionCo</td>\n<td style='text-align: center;'>19.99</td>\n<td style='text-align: center;'>301-C</td>\n<td style='text-align: center;'>Color: Green, Size: L</td>\n</tr>\n<tr>\n<td style='text-align: left;'>12</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>Jeans</td>\n<td style='text-align: center;'>302</td>\n<td style='text-align: center;'>DenimWorks</td>\n<td style='text-align: center;'>49.99</td>\n<td style='text-align: center;'>302-A</td>\n<td style='text-align: center;'>Color: Dark Blue, Size: 32</td>\n</tr>\n<tr>\n<td style='text-align: left;'>13</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>Jeans</td>\n<td style='text-align: center;'>302</td>\n<td style='text-align: center;'>DenimWorks</td>\n<td style='text-align: center;'>49.99</td>\n<td style='text-align: center;'>302-B</td>\n<td style='text-align: center;'>Color: Light Blue, Size: 34</td>\n</tr>\n<tr>\n<td style='text-align: left;'>14</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Fiction Novel</td>\n<td style='text-align: center;'>401</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>14.99</td>\n<td style='text-align: center;'>401-A</td>\n<td style='text-align: center;'>Format: Hardcover, Language: English</td>\n</tr>\n<tr>\n<td style='text-align: left;'>15</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Fiction Novel</td>\n<td style='text-align: center;'>401</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>14.99</td>\n<td style='text-align: center;'>401-B</td>\n<td style='text-align: center;'>Format: Paperback, Language: Spanish</td>\n</tr>\n<tr>\n<td style='text-align: left;'>16</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Non-Fiction Guide</td>\n<td style='text-align: center;'>402</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>24.99</td>\n<td style='text-align: center;'>402-A</td>\n<td style='text-align: center;'>Format: eBook, Language: English</td>\n</tr>\n<tr>\n<td style='text-align: left;'>17</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Non-Fiction Guide</td>\n<td style='text-align: center;'>402</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>24.99</td>\n<td style='text-align: center;'>402-B</td>\n<td style='text-align: center;'>Format: Paperback, Language: French</td>\n</tr>\n<tr>\n<td style='text-align: left;'>18</td>\n<td style='text-align: center;'>Sports Equipment</td>\n<td style='text-align: center;'>Basketball</td>\n<td style='text-align: center;'>501</td>\n<td style='text-align: center;'>SportsGear</td>\n<td style='text-align: center;'>29.99</td>\n<td style='text-align: center;'>501-A</td>\n<td style='text-align: center;'>Size: Size 7, Color: Orange</td>\n</tr>\n<tr>\n<td style='text-align: left;'>19</td>\n<td style='text-align: center;'>Sports Equipment</td>\n<td style='text-align: center;'>Tennis Racket</td>\n<td style='text-align: center;'>502</td>\n<td style='text-align: center;'>RacketPro</td>\n<td style='text-align: center;'>89.99</td>\n<td style='text-align: center;'>502-A</td>\n<td style='text-align: center;'>Material: Graphite, Color: Black</td>\n</tr>\n<tr>\n<td style='border-bottom: 2px solid grey; text-align: left;'>20</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>Sports Equipment</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>Tennis Racket</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>502</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>RacketPro</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>89.99</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>502-B</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>Material: Aluminum, Color: Silver</td>\n</tr>\n</tbody>\n</table>"
library(xml2)
xml_table <- xml_new_root("table_text")
apply(table_text, 1, function(row) {
row_node <- xml_add_child(xml_table, "Row")
lapply(names(row), function(col_name) {
xml_add_child(row_node, col_name, row[col_name])
})
})
## [[1]]
## [[1]][[1]]
## {xml_node}
## <Category>
##
## [[1]][[2]]
## {xml_node}
## <Item Name>
##
## [[1]][[3]]
## {xml_node}
## <Item ID>
##
## [[1]][[4]]
## {xml_node}
## <Brand>
##
## [[1]][[5]]
## {xml_node}
## <Price>
##
## [[1]][[6]]
## {xml_node}
## <Variation ID>
##
## [[1]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[2]]
## [[2]][[1]]
## {xml_node}
## <Category>
##
## [[2]][[2]]
## {xml_node}
## <Item Name>
##
## [[2]][[3]]
## {xml_node}
## <Item ID>
##
## [[2]][[4]]
## {xml_node}
## <Brand>
##
## [[2]][[5]]
## {xml_node}
## <Price>
##
## [[2]][[6]]
## {xml_node}
## <Variation ID>
##
## [[2]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[3]]
## [[3]][[1]]
## {xml_node}
## <Category>
##
## [[3]][[2]]
## {xml_node}
## <Item Name>
##
## [[3]][[3]]
## {xml_node}
## <Item ID>
##
## [[3]][[4]]
## {xml_node}
## <Brand>
##
## [[3]][[5]]
## {xml_node}
## <Price>
##
## [[3]][[6]]
## {xml_node}
## <Variation ID>
##
## [[3]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[4]]
## [[4]][[1]]
## {xml_node}
## <Category>
##
## [[4]][[2]]
## {xml_node}
## <Item Name>
##
## [[4]][[3]]
## {xml_node}
## <Item ID>
##
## [[4]][[4]]
## {xml_node}
## <Brand>
##
## [[4]][[5]]
## {xml_node}
## <Price>
##
## [[4]][[6]]
## {xml_node}
## <Variation ID>
##
## [[4]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[5]]
## [[5]][[1]]
## {xml_node}
## <Category>
##
## [[5]][[2]]
## {xml_node}
## <Item Name>
##
## [[5]][[3]]
## {xml_node}
## <Item ID>
##
## [[5]][[4]]
## {xml_node}
## <Brand>
##
## [[5]][[5]]
## {xml_node}
## <Price>
##
## [[5]][[6]]
## {xml_node}
## <Variation ID>
##
## [[5]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[6]]
## [[6]][[1]]
## {xml_node}
## <Category>
##
## [[6]][[2]]
## {xml_node}
## <Item Name>
##
## [[6]][[3]]
## {xml_node}
## <Item ID>
##
## [[6]][[4]]
## {xml_node}
## <Brand>
##
## [[6]][[5]]
## {xml_node}
## <Price>
##
## [[6]][[6]]
## {xml_node}
## <Variation ID>
##
## [[6]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[7]]
## [[7]][[1]]
## {xml_node}
## <Category>
##
## [[7]][[2]]
## {xml_node}
## <Item Name>
##
## [[7]][[3]]
## {xml_node}
## <Item ID>
##
## [[7]][[4]]
## {xml_node}
## <Brand>
##
## [[7]][[5]]
## {xml_node}
## <Price>
##
## [[7]][[6]]
## {xml_node}
## <Variation ID>
##
## [[7]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[8]]
## [[8]][[1]]
## {xml_node}
## <Category>
##
## [[8]][[2]]
## {xml_node}
## <Item Name>
##
## [[8]][[3]]
## {xml_node}
## <Item ID>
##
## [[8]][[4]]
## {xml_node}
## <Brand>
##
## [[8]][[5]]
## {xml_node}
## <Price>
##
## [[8]][[6]]
## {xml_node}
## <Variation ID>
##
## [[8]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[9]]
## [[9]][[1]]
## {xml_node}
## <Category>
##
## [[9]][[2]]
## {xml_node}
## <Item Name>
##
## [[9]][[3]]
## {xml_node}
## <Item ID>
##
## [[9]][[4]]
## {xml_node}
## <Brand>
##
## [[9]][[5]]
## {xml_node}
## <Price>
##
## [[9]][[6]]
## {xml_node}
## <Variation ID>
##
## [[9]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[10]]
## [[10]][[1]]
## {xml_node}
## <Category>
##
## [[10]][[2]]
## {xml_node}
## <Item Name>
##
## [[10]][[3]]
## {xml_node}
## <Item ID>
##
## [[10]][[4]]
## {xml_node}
## <Brand>
##
## [[10]][[5]]
## {xml_node}
## <Price>
##
## [[10]][[6]]
## {xml_node}
## <Variation ID>
##
## [[10]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[11]]
## [[11]][[1]]
## {xml_node}
## <Category>
##
## [[11]][[2]]
## {xml_node}
## <Item Name>
##
## [[11]][[3]]
## {xml_node}
## <Item ID>
##
## [[11]][[4]]
## {xml_node}
## <Brand>
##
## [[11]][[5]]
## {xml_node}
## <Price>
##
## [[11]][[6]]
## {xml_node}
## <Variation ID>
##
## [[11]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[12]]
## [[12]][[1]]
## {xml_node}
## <Category>
##
## [[12]][[2]]
## {xml_node}
## <Item Name>
##
## [[12]][[3]]
## {xml_node}
## <Item ID>
##
## [[12]][[4]]
## {xml_node}
## <Brand>
##
## [[12]][[5]]
## {xml_node}
## <Price>
##
## [[12]][[6]]
## {xml_node}
## <Variation ID>
##
## [[12]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[13]]
## [[13]][[1]]
## {xml_node}
## <Category>
##
## [[13]][[2]]
## {xml_node}
## <Item Name>
##
## [[13]][[3]]
## {xml_node}
## <Item ID>
##
## [[13]][[4]]
## {xml_node}
## <Brand>
##
## [[13]][[5]]
## {xml_node}
## <Price>
##
## [[13]][[6]]
## {xml_node}
## <Variation ID>
##
## [[13]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[14]]
## [[14]][[1]]
## {xml_node}
## <Category>
##
## [[14]][[2]]
## {xml_node}
## <Item Name>
##
## [[14]][[3]]
## {xml_node}
## <Item ID>
##
## [[14]][[4]]
## {xml_node}
## <Brand>
##
## [[14]][[5]]
## {xml_node}
## <Price>
##
## [[14]][[6]]
## {xml_node}
## <Variation ID>
##
## [[14]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[15]]
## [[15]][[1]]
## {xml_node}
## <Category>
##
## [[15]][[2]]
## {xml_node}
## <Item Name>
##
## [[15]][[3]]
## {xml_node}
## <Item ID>
##
## [[15]][[4]]
## {xml_node}
## <Brand>
##
## [[15]][[5]]
## {xml_node}
## <Price>
##
## [[15]][[6]]
## {xml_node}
## <Variation ID>
##
## [[15]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[16]]
## [[16]][[1]]
## {xml_node}
## <Category>
##
## [[16]][[2]]
## {xml_node}
## <Item Name>
##
## [[16]][[3]]
## {xml_node}
## <Item ID>
##
## [[16]][[4]]
## {xml_node}
## <Brand>
##
## [[16]][[5]]
## {xml_node}
## <Price>
##
## [[16]][[6]]
## {xml_node}
## <Variation ID>
##
## [[16]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[17]]
## [[17]][[1]]
## {xml_node}
## <Category>
##
## [[17]][[2]]
## {xml_node}
## <Item Name>
##
## [[17]][[3]]
## {xml_node}
## <Item ID>
##
## [[17]][[4]]
## {xml_node}
## <Brand>
##
## [[17]][[5]]
## {xml_node}
## <Price>
##
## [[17]][[6]]
## {xml_node}
## <Variation ID>
##
## [[17]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[18]]
## [[18]][[1]]
## {xml_node}
## <Category>
##
## [[18]][[2]]
## {xml_node}
## <Item Name>
##
## [[18]][[3]]
## {xml_node}
## <Item ID>
##
## [[18]][[4]]
## {xml_node}
## <Brand>
##
## [[18]][[5]]
## {xml_node}
## <Price>
##
## [[18]][[6]]
## {xml_node}
## <Variation ID>
##
## [[18]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[19]]
## [[19]][[1]]
## {xml_node}
## <Category>
##
## [[19]][[2]]
## {xml_node}
## <Item Name>
##
## [[19]][[3]]
## {xml_node}
## <Item ID>
##
## [[19]][[4]]
## {xml_node}
## <Brand>
##
## [[19]][[5]]
## {xml_node}
## <Price>
##
## [[19]][[6]]
## {xml_node}
## <Variation ID>
##
## [[19]][[7]]
## {xml_node}
## <Variation Details>
##
##
## [[20]]
## [[20]][[1]]
## {xml_node}
## <Category>
##
## [[20]][[2]]
## {xml_node}
## <Item Name>
##
## [[20]][[3]]
## {xml_node}
## <Item ID>
##
## [[20]][[4]]
## {xml_node}
## <Brand>
##
## [[20]][[5]]
## {xml_node}
## <Price>
##
## [[20]][[6]]
## {xml_node}
## <Variation ID>
##
## [[20]][[7]]
## {xml_node}
## <Variation Details>
print(head(xml_table))
## $node
## <pointer: 0x7fe1be6fd680>
##
## $doc
## <pointer: 0x7fe1be675af0>
library(arrow)
## Warning: package 'arrow' was built under R version 4.4.1
##
## Attaching package: 'arrow'
## The following object is masked from 'package:lubridate':
##
## duration
## The following object is masked from 'package:utils':
##
## timestamp
tf <- tempfile(fileext = ".parquet")
parquet_table <- write_parquet(table_text, tf)
print(parquet_table)
## # A tibble: 20 × 7
## Category `Item Name` `Item ID` Brand Price `Variation ID` `Variation Details`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Electro… Smartphone 101 Tech… 699.… 101-A Color: Black, Stor…
## 2 Electro… Smartphone 101 Tech… 699.… 101-B Color: White, Stor…
## 3 Electro… Laptop 102 Comp… 1099… 102-A Color: Silver, Sto…
## 4 Electro… Laptop 102 Comp… 1099… 102-B Color: Space Gray,…
## 5 Home Ap… Refrigerat… 201 Home… 899.… 201-A Color: Stainless S…
## 6 Home Ap… Refrigerat… 201 Home… 899.… 201-B Color: White, Capa…
## 7 Home Ap… Washing Ma… 202 Clea… 499.… 202-A Type: Front Load, …
## 8 Home Ap… Washing Ma… 202 Clea… 499.… 202-B Type: Top Load, Ca…
## 9 Clothing T-Shirt 301 Fash… 19.99 301-A Color: Blue, Size:…
## 10 Clothing T-Shirt 301 Fash… 19.99 301-B Color: Red, Size: M
## 11 Clothing T-Shirt 301 Fash… 19.99 301-C Color: Green, Size…
## 12 Clothing Jeans 302 Deni… 49.99 302-A Color: Dark Blue, …
## 13 Clothing Jeans 302 Deni… 49.99 302-B Color: Light Blue,…
## 14 Books Fiction No… 401 <NA> 14.99 401-A Format: Hardcover,…
## 15 Books Fiction No… 401 <NA> 14.99 401-B Format: Paperback,…
## 16 Books Non-Fictio… 402 <NA> 24.99 402-A Format: eBook, Lan…
## 17 Books Non-Fictio… 402 <NA> 24.99 402-B Format: Paperback,…
## 18 Sports … Basketball 501 Spor… 29.99 501-A Size: Size 7, Colo…
## 19 Sports … Tennis Rac… 502 Rack… 89.99 502-A Material: Graphite…
## 20 Sports … Tennis Rac… 502 Rack… 89.99 502-B Material: Aluminum…