Working with JSON, HTML, XML, and Parquet in R

Using the pdftools library, I imported the text directly from the assignment PDF before using read_csv to read the comma-separated text into a data frame, dropping any unnecessary rows.

library(pdftools)

raw_text <- pdf_text("File_Formats_Assignments.pdf")

table_text <- read_csv(file = raw_text, skip = 4, show_col_types = FALSE)

table_text <- table_text[-c(24:27),]

The only columns with data in rows 6, 9 and 11 had been split from the last column of each previous row and needed to be appended before deletion.

Then I converted the dashes in the Brand column to NA values.

table_text[5, 7] <- paste(table_text[5, 7],  table_text[6, 1], sep=" ")
table_text[8, 7] <- paste(table_text[8, 7],  table_text[9, 1], sep=" ")
table_text[10, 7] <- paste(table_text[10, 7],  table_text[11, 1], sep=" ")

table_text <- table_text[-c(6, 9, 11), ]

table_text$Brand[table_text$Brand == "-"] <- NA

Below, I used thew following libraries to convert the dataframe into their respective formats: rjson, htmlTable, xml2 and arrow.

JSON: As a front-end software engineer, I have had the most experience with this format. It’s fairly easy to read and very popular, with plenty of available tools to parse, convert, etc. Most APIs return data in this format, and so it can be used across many languages and libraries. But since JSON basically comes in one long string, it’s pretty inefficient and lacks structure.

HTML: This is the basic structure of basically every web page. It’s not meant for data transmission, but for information display. It’s inevitable if web scraping though.

XML: Like HTML, XML uses tags, which makes it inefficient and full of strings that may need to be parsed out. But since these tags can be customized, there can be more structure and definition to the data.

PARQUET: Parquet is a columnar format, designed for efficiency and scalability with very large sets of data and can be used across platforms. However, once converted, the data is no longer readable by humans.


JSON

library(rjson)
## Warning: package 'rjson' was built under R version 4.4.1
json_table <- toJSON(table_text)
print(json_table, type = "json")
## [1] "{\"Category\":[\"Electronics\",\"Electronics\",\"Electronics\",\"Electronics\",\"Home Appliances\",\"Home Appliances\",\"Home Appliances\",\"Home Appliances\",\"Clothing\",\"Clothing\",\"Clothing\",\"Clothing\",\"Clothing\",\"Books\",\"Books\",\"Books\",\"Books\",\"Sports Equipment\",\"Sports Equipment\",\"Sports Equipment\"],\"Item Name\":[\"Smartphone\",\"Smartphone\",\"Laptop\",\"Laptop\",\"Refrigerator\",\"Refrigerator\",\"Washing Machine\",\"Washing Machine\",\"T-Shirt\",\"T-Shirt\",\"T-Shirt\",\"Jeans\",\"Jeans\",\"Fiction Novel\",\"Fiction Novel\",\"Non-Fiction Guide\",\"Non-Fiction Guide\",\"Basketball\",\"Tennis Racket\",\"Tennis Racket\"],\"Item ID\":[\"101\",\"101\",\"102\",\"102\",\"201\",\"201\",\"202\",\"202\",\"301\",\"301\",\"301\",\"302\",\"302\",\"401\",\"401\",\"402\",\"402\",\"501\",\"502\",\"502\"],\"Brand\":[\"TechBrand\",\"TechBrand\",\"CompuBrand\",\"CompuBrand\",\"HomeCool\",\"HomeCool\",\"CleanTech\",\"CleanTech\",\"FashionCo\",\"FashionCo\",\"FashionCo\",\"DenimWorks\",\"DenimWorks\",\"NA\",\"NA\",\"NA\",\"NA\",\"SportsGear\",\"RacketPro\",\"RacketPro\"],\"Price\":[\"699.99\",\"699.99\",\"1099.99\",\"1099.99\",\"899.99\",\"899.99\",\"499.99\",\"499.99\",\"19.99\",\"19.99\",\"19.99\",\"49.99\",\"49.99\",\"14.99\",\"14.99\",\"24.99\",\"24.99\",\"29.99\",\"89.99\",\"89.99\"],\"Variation ID\":[\"101-A\",\"101-B\",\"102-A\",\"102-B\",\"201-A\",\"201-B\",\"202-A\",\"202-B\",\"301-A\",\"301-B\",\"301-C\",\"302-A\",\"302-B\",\"401-A\",\"401-B\",\"402-A\",\"402-B\",\"501-A\",\"502-A\",\"502-B\"],\"Variation Details\":[\"Color: Black, Storage: 64GB\",\"Color: White, Storage: 128GB\",\"Color: Silver, Storage: 256GB\",\"Color: Space Gray, Storage: 512GB\",\"Color: Stainless Steel, Capacity: 20 cu ft\",\"Color: White, Capacity: 18 cu ft\",\"Type: Front Load, Capacity: 4.5 cu ft\",\"Type: Top Load, Capacity: 5.0 cu ft\",\"Color: Blue, Size: S\",\"Color: Red, Size: M\",\"Color: Green, Size: L\",\"Color: Dark Blue, Size: 32\",\"Color: Light Blue, Size: 34\",\"Format: Hardcover, Language: English\",\"Format: Paperback, Language: Spanish\",\"Format: eBook, Language: English\",\"Format: Paperback, Language: French\",\"Size: Size 7, Color: Orange\",\"Material: Graphite, Color: Black\",\"Material: Aluminum, Color: Silver\"]}"

HTML

library(htmlTable)

html_table <- htmlTable(table_text)
print(head(html_table))
## [1] "<table class='gmisc_table' style='border-collapse: collapse; margin-top: 1em; margin-bottom: 1em;' >\n<thead>\n<tr><th style='border-bottom: 1px solid grey; border-top: 2px solid grey;'></th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Category</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Item Name</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Item ID</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Brand</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Price</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Variation ID</th>\n<th style='font-weight: 900; border-bottom: 1px solid grey; border-top: 2px solid grey; text-align: center;'>Variation Details</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td style='text-align: left;'>1</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Smartphone</td>\n<td style='text-align: center;'>101</td>\n<td style='text-align: center;'>TechBrand</td>\n<td style='text-align: center;'>699.99</td>\n<td style='text-align: center;'>101-A</td>\n<td style='text-align: center;'>Color: Black, Storage: 64GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>2</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Smartphone</td>\n<td style='text-align: center;'>101</td>\n<td style='text-align: center;'>TechBrand</td>\n<td style='text-align: center;'>699.99</td>\n<td style='text-align: center;'>101-B</td>\n<td style='text-align: center;'>Color: White, Storage: 128GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>3</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Laptop</td>\n<td style='text-align: center;'>102</td>\n<td style='text-align: center;'>CompuBrand</td>\n<td style='text-align: center;'>1099.99</td>\n<td style='text-align: center;'>102-A</td>\n<td style='text-align: center;'>Color: Silver, Storage: 256GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>4</td>\n<td style='text-align: center;'>Electronics</td>\n<td style='text-align: center;'>Laptop</td>\n<td style='text-align: center;'>102</td>\n<td style='text-align: center;'>CompuBrand</td>\n<td style='text-align: center;'>1099.99</td>\n<td style='text-align: center;'>102-B</td>\n<td style='text-align: center;'>Color: Space Gray, Storage: 512GB</td>\n</tr>\n<tr>\n<td style='text-align: left;'>5</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Refrigerator</td>\n<td style='text-align: center;'>201</td>\n<td style='text-align: center;'>HomeCool</td>\n<td style='text-align: center;'>899.99</td>\n<td style='text-align: center;'>201-A</td>\n<td style='text-align: center;'>Color: Stainless Steel, Capacity: 20 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>6</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Refrigerator</td>\n<td style='text-align: center;'>201</td>\n<td style='text-align: center;'>HomeCool</td>\n<td style='text-align: center;'>899.99</td>\n<td style='text-align: center;'>201-B</td>\n<td style='text-align: center;'>Color: White, Capacity: 18 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>7</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Washing Machine</td>\n<td style='text-align: center;'>202</td>\n<td style='text-align: center;'>CleanTech</td>\n<td style='text-align: center;'>499.99</td>\n<td style='text-align: center;'>202-A</td>\n<td style='text-align: center;'>Type: Front Load, Capacity: 4.5 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>8</td>\n<td style='text-align: center;'>Home Appliances</td>\n<td style='text-align: center;'>Washing Machine</td>\n<td style='text-align: center;'>202</td>\n<td style='text-align: center;'>CleanTech</td>\n<td style='text-align: center;'>499.99</td>\n<td style='text-align: center;'>202-B</td>\n<td style='text-align: center;'>Type: Top Load, Capacity: 5.0 cu ft</td>\n</tr>\n<tr>\n<td style='text-align: left;'>9</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>T-Shirt</td>\n<td style='text-align: center;'>301</td>\n<td style='text-align: center;'>FashionCo</td>\n<td style='text-align: center;'>19.99</td>\n<td style='text-align: center;'>301-A</td>\n<td style='text-align: center;'>Color: Blue, Size: S</td>\n</tr>\n<tr>\n<td style='text-align: left;'>10</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>T-Shirt</td>\n<td style='text-align: center;'>301</td>\n<td style='text-align: center;'>FashionCo</td>\n<td style='text-align: center;'>19.99</td>\n<td style='text-align: center;'>301-B</td>\n<td style='text-align: center;'>Color: Red, Size: M</td>\n</tr>\n<tr>\n<td style='text-align: left;'>11</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>T-Shirt</td>\n<td style='text-align: center;'>301</td>\n<td style='text-align: center;'>FashionCo</td>\n<td style='text-align: center;'>19.99</td>\n<td style='text-align: center;'>301-C</td>\n<td style='text-align: center;'>Color: Green, Size: L</td>\n</tr>\n<tr>\n<td style='text-align: left;'>12</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>Jeans</td>\n<td style='text-align: center;'>302</td>\n<td style='text-align: center;'>DenimWorks</td>\n<td style='text-align: center;'>49.99</td>\n<td style='text-align: center;'>302-A</td>\n<td style='text-align: center;'>Color: Dark Blue, Size: 32</td>\n</tr>\n<tr>\n<td style='text-align: left;'>13</td>\n<td style='text-align: center;'>Clothing</td>\n<td style='text-align: center;'>Jeans</td>\n<td style='text-align: center;'>302</td>\n<td style='text-align: center;'>DenimWorks</td>\n<td style='text-align: center;'>49.99</td>\n<td style='text-align: center;'>302-B</td>\n<td style='text-align: center;'>Color: Light Blue, Size: 34</td>\n</tr>\n<tr>\n<td style='text-align: left;'>14</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Fiction Novel</td>\n<td style='text-align: center;'>401</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>14.99</td>\n<td style='text-align: center;'>401-A</td>\n<td style='text-align: center;'>Format: Hardcover, Language: English</td>\n</tr>\n<tr>\n<td style='text-align: left;'>15</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Fiction Novel</td>\n<td style='text-align: center;'>401</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>14.99</td>\n<td style='text-align: center;'>401-B</td>\n<td style='text-align: center;'>Format: Paperback, Language: Spanish</td>\n</tr>\n<tr>\n<td style='text-align: left;'>16</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Non-Fiction Guide</td>\n<td style='text-align: center;'>402</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>24.99</td>\n<td style='text-align: center;'>402-A</td>\n<td style='text-align: center;'>Format: eBook, Language: English</td>\n</tr>\n<tr>\n<td style='text-align: left;'>17</td>\n<td style='text-align: center;'>Books</td>\n<td style='text-align: center;'>Non-Fiction Guide</td>\n<td style='text-align: center;'>402</td>\n<td style='text-align: center;'></td>\n<td style='text-align: center;'>24.99</td>\n<td style='text-align: center;'>402-B</td>\n<td style='text-align: center;'>Format: Paperback, Language: French</td>\n</tr>\n<tr>\n<td style='text-align: left;'>18</td>\n<td style='text-align: center;'>Sports Equipment</td>\n<td style='text-align: center;'>Basketball</td>\n<td style='text-align: center;'>501</td>\n<td style='text-align: center;'>SportsGear</td>\n<td style='text-align: center;'>29.99</td>\n<td style='text-align: center;'>501-A</td>\n<td style='text-align: center;'>Size: Size 7, Color: Orange</td>\n</tr>\n<tr>\n<td style='text-align: left;'>19</td>\n<td style='text-align: center;'>Sports Equipment</td>\n<td style='text-align: center;'>Tennis Racket</td>\n<td style='text-align: center;'>502</td>\n<td style='text-align: center;'>RacketPro</td>\n<td style='text-align: center;'>89.99</td>\n<td style='text-align: center;'>502-A</td>\n<td style='text-align: center;'>Material: Graphite, Color: Black</td>\n</tr>\n<tr>\n<td style='border-bottom: 2px solid grey; text-align: left;'>20</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>Sports Equipment</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>Tennis Racket</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>502</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>RacketPro</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>89.99</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>502-B</td>\n<td style='border-bottom: 2px solid grey; text-align: center;'>Material: Aluminum, Color: Silver</td>\n</tr>\n</tbody>\n</table>"

XML

library(xml2)

xml_table <- xml_new_root("table_text")

apply(table_text, 1, function(row) {
  row_node <- xml_add_child(xml_table, "Row")
  
  lapply(names(row), function(col_name) {
    xml_add_child(row_node, col_name, row[col_name])
  })
})
## [[1]]
## [[1]][[1]]
## {xml_node}
## <Category>
## 
## [[1]][[2]]
## {xml_node}
## <Item Name>
## 
## [[1]][[3]]
## {xml_node}
## <Item ID>
## 
## [[1]][[4]]
## {xml_node}
## <Brand>
## 
## [[1]][[5]]
## {xml_node}
## <Price>
## 
## [[1]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[1]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[2]]
## [[2]][[1]]
## {xml_node}
## <Category>
## 
## [[2]][[2]]
## {xml_node}
## <Item Name>
## 
## [[2]][[3]]
## {xml_node}
## <Item ID>
## 
## [[2]][[4]]
## {xml_node}
## <Brand>
## 
## [[2]][[5]]
## {xml_node}
## <Price>
## 
## [[2]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[2]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[3]]
## [[3]][[1]]
## {xml_node}
## <Category>
## 
## [[3]][[2]]
## {xml_node}
## <Item Name>
## 
## [[3]][[3]]
## {xml_node}
## <Item ID>
## 
## [[3]][[4]]
## {xml_node}
## <Brand>
## 
## [[3]][[5]]
## {xml_node}
## <Price>
## 
## [[3]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[3]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[4]]
## [[4]][[1]]
## {xml_node}
## <Category>
## 
## [[4]][[2]]
## {xml_node}
## <Item Name>
## 
## [[4]][[3]]
## {xml_node}
## <Item ID>
## 
## [[4]][[4]]
## {xml_node}
## <Brand>
## 
## [[4]][[5]]
## {xml_node}
## <Price>
## 
## [[4]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[4]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[5]]
## [[5]][[1]]
## {xml_node}
## <Category>
## 
## [[5]][[2]]
## {xml_node}
## <Item Name>
## 
## [[5]][[3]]
## {xml_node}
## <Item ID>
## 
## [[5]][[4]]
## {xml_node}
## <Brand>
## 
## [[5]][[5]]
## {xml_node}
## <Price>
## 
## [[5]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[5]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[6]]
## [[6]][[1]]
## {xml_node}
## <Category>
## 
## [[6]][[2]]
## {xml_node}
## <Item Name>
## 
## [[6]][[3]]
## {xml_node}
## <Item ID>
## 
## [[6]][[4]]
## {xml_node}
## <Brand>
## 
## [[6]][[5]]
## {xml_node}
## <Price>
## 
## [[6]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[6]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[7]]
## [[7]][[1]]
## {xml_node}
## <Category>
## 
## [[7]][[2]]
## {xml_node}
## <Item Name>
## 
## [[7]][[3]]
## {xml_node}
## <Item ID>
## 
## [[7]][[4]]
## {xml_node}
## <Brand>
## 
## [[7]][[5]]
## {xml_node}
## <Price>
## 
## [[7]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[7]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[8]]
## [[8]][[1]]
## {xml_node}
## <Category>
## 
## [[8]][[2]]
## {xml_node}
## <Item Name>
## 
## [[8]][[3]]
## {xml_node}
## <Item ID>
## 
## [[8]][[4]]
## {xml_node}
## <Brand>
## 
## [[8]][[5]]
## {xml_node}
## <Price>
## 
## [[8]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[8]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[9]]
## [[9]][[1]]
## {xml_node}
## <Category>
## 
## [[9]][[2]]
## {xml_node}
## <Item Name>
## 
## [[9]][[3]]
## {xml_node}
## <Item ID>
## 
## [[9]][[4]]
## {xml_node}
## <Brand>
## 
## [[9]][[5]]
## {xml_node}
## <Price>
## 
## [[9]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[9]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[10]]
## [[10]][[1]]
## {xml_node}
## <Category>
## 
## [[10]][[2]]
## {xml_node}
## <Item Name>
## 
## [[10]][[3]]
## {xml_node}
## <Item ID>
## 
## [[10]][[4]]
## {xml_node}
## <Brand>
## 
## [[10]][[5]]
## {xml_node}
## <Price>
## 
## [[10]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[10]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[11]]
## [[11]][[1]]
## {xml_node}
## <Category>
## 
## [[11]][[2]]
## {xml_node}
## <Item Name>
## 
## [[11]][[3]]
## {xml_node}
## <Item ID>
## 
## [[11]][[4]]
## {xml_node}
## <Brand>
## 
## [[11]][[5]]
## {xml_node}
## <Price>
## 
## [[11]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[11]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[12]]
## [[12]][[1]]
## {xml_node}
## <Category>
## 
## [[12]][[2]]
## {xml_node}
## <Item Name>
## 
## [[12]][[3]]
## {xml_node}
## <Item ID>
## 
## [[12]][[4]]
## {xml_node}
## <Brand>
## 
## [[12]][[5]]
## {xml_node}
## <Price>
## 
## [[12]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[12]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[13]]
## [[13]][[1]]
## {xml_node}
## <Category>
## 
## [[13]][[2]]
## {xml_node}
## <Item Name>
## 
## [[13]][[3]]
## {xml_node}
## <Item ID>
## 
## [[13]][[4]]
## {xml_node}
## <Brand>
## 
## [[13]][[5]]
## {xml_node}
## <Price>
## 
## [[13]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[13]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[14]]
## [[14]][[1]]
## {xml_node}
## <Category>
## 
## [[14]][[2]]
## {xml_node}
## <Item Name>
## 
## [[14]][[3]]
## {xml_node}
## <Item ID>
## 
## [[14]][[4]]
## {xml_node}
## <Brand>
## 
## [[14]][[5]]
## {xml_node}
## <Price>
## 
## [[14]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[14]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[15]]
## [[15]][[1]]
## {xml_node}
## <Category>
## 
## [[15]][[2]]
## {xml_node}
## <Item Name>
## 
## [[15]][[3]]
## {xml_node}
## <Item ID>
## 
## [[15]][[4]]
## {xml_node}
## <Brand>
## 
## [[15]][[5]]
## {xml_node}
## <Price>
## 
## [[15]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[15]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[16]]
## [[16]][[1]]
## {xml_node}
## <Category>
## 
## [[16]][[2]]
## {xml_node}
## <Item Name>
## 
## [[16]][[3]]
## {xml_node}
## <Item ID>
## 
## [[16]][[4]]
## {xml_node}
## <Brand>
## 
## [[16]][[5]]
## {xml_node}
## <Price>
## 
## [[16]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[16]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[17]]
## [[17]][[1]]
## {xml_node}
## <Category>
## 
## [[17]][[2]]
## {xml_node}
## <Item Name>
## 
## [[17]][[3]]
## {xml_node}
## <Item ID>
## 
## [[17]][[4]]
## {xml_node}
## <Brand>
## 
## [[17]][[5]]
## {xml_node}
## <Price>
## 
## [[17]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[17]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[18]]
## [[18]][[1]]
## {xml_node}
## <Category>
## 
## [[18]][[2]]
## {xml_node}
## <Item Name>
## 
## [[18]][[3]]
## {xml_node}
## <Item ID>
## 
## [[18]][[4]]
## {xml_node}
## <Brand>
## 
## [[18]][[5]]
## {xml_node}
## <Price>
## 
## [[18]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[18]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[19]]
## [[19]][[1]]
## {xml_node}
## <Category>
## 
## [[19]][[2]]
## {xml_node}
## <Item Name>
## 
## [[19]][[3]]
## {xml_node}
## <Item ID>
## 
## [[19]][[4]]
## {xml_node}
## <Brand>
## 
## [[19]][[5]]
## {xml_node}
## <Price>
## 
## [[19]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[19]][[7]]
## {xml_node}
## <Variation Details>
## 
## 
## [[20]]
## [[20]][[1]]
## {xml_node}
## <Category>
## 
## [[20]][[2]]
## {xml_node}
## <Item Name>
## 
## [[20]][[3]]
## {xml_node}
## <Item ID>
## 
## [[20]][[4]]
## {xml_node}
## <Brand>
## 
## [[20]][[5]]
## {xml_node}
## <Price>
## 
## [[20]][[6]]
## {xml_node}
## <Variation ID>
## 
## [[20]][[7]]
## {xml_node}
## <Variation Details>
print(head(xml_table))
## $node
## <pointer: 0x7fe1be6fd680>
## 
## $doc
## <pointer: 0x7fe1be675af0>

Parquet

library(arrow)
## Warning: package 'arrow' was built under R version 4.4.1
## 
## Attaching package: 'arrow'
## The following object is masked from 'package:lubridate':
## 
##     duration
## The following object is masked from 'package:utils':
## 
##     timestamp
tf <- tempfile(fileext = ".parquet")
parquet_table <- write_parquet(table_text, tf)

print(parquet_table)
## # A tibble: 20 × 7
##    Category `Item Name` `Item ID` Brand Price `Variation ID` `Variation Details`
##    <chr>    <chr>       <chr>     <chr> <chr> <chr>          <chr>              
##  1 Electro… Smartphone  101       Tech… 699.… 101-A          Color: Black, Stor…
##  2 Electro… Smartphone  101       Tech… 699.… 101-B          Color: White, Stor…
##  3 Electro… Laptop      102       Comp… 1099… 102-A          Color: Silver, Sto…
##  4 Electro… Laptop      102       Comp… 1099… 102-B          Color: Space Gray,…
##  5 Home Ap… Refrigerat… 201       Home… 899.… 201-A          Color: Stainless S…
##  6 Home Ap… Refrigerat… 201       Home… 899.… 201-B          Color: White, Capa…
##  7 Home Ap… Washing Ma… 202       Clea… 499.… 202-A          Type: Front Load, …
##  8 Home Ap… Washing Ma… 202       Clea… 499.… 202-B          Type: Top Load, Ca…
##  9 Clothing T-Shirt     301       Fash… 19.99 301-A          Color: Blue, Size:…
## 10 Clothing T-Shirt     301       Fash… 19.99 301-B          Color: Red, Size: M
## 11 Clothing T-Shirt     301       Fash… 19.99 301-C          Color: Green, Size…
## 12 Clothing Jeans       302       Deni… 49.99 302-A          Color: Dark Blue, …
## 13 Clothing Jeans       302       Deni… 49.99 302-B          Color: Light Blue,…
## 14 Books    Fiction No… 401       <NA>  14.99 401-A          Format: Hardcover,…
## 15 Books    Fiction No… 401       <NA>  14.99 401-B          Format: Paperback,…
## 16 Books    Non-Fictio… 402       <NA>  24.99 402-A          Format: eBook, Lan…
## 17 Books    Non-Fictio… 402       <NA>  24.99 402-B          Format: Paperback,…
## 18 Sports … Basketball  501       Spor… 29.99 501-A          Size: Size 7, Colo…
## 19 Sports … Tennis Rac… 502       Rack… 89.99 502-A          Material: Graphite…
## 20 Sports … Tennis Rac… 502       Rack… 89.99 502-B          Material: Aluminum…