library(tidyverse)
library(arrow)
library(dbplyr, warn.conflicts = FALSE)
library(duckdb)Module 8 Reflection: Arrow, Parquet, and DuckDB
1 Introduction
This reflection covers the three videos assigned in Module 8 of IBM 6530, all focused on how the R ecosystem has evolved to handle datasets that are larger than a single computer’s memory. The same set of tools keeps appearing across all three videos: Apache Arrow, the Parquet file format, and the DuckDB query engine. Each speaker introduces them from a different angle.
2 Step 1: Scaling Up Data Workflows with Arrow, Parquet, and DuckDB
Presenter: Neal Richardson, co-creator of the arrow R package and co-author of Scaling Up With R and Arrow
The biggest takeaway from Neal Richardson’s talk is that working with large data in R is no longer about buying more RAM or learning a completely new tool. It’s about pairing R with the right file format and the right query engine so that R doesn’t have to load everything into memory at once. He frames the whole talk around three building blocks: Parquet files, partitioning, and query engines.
The first idea that stuck with me is that Parquet is to disk what Arrow is to memory. Both are open, community-supported standards, but they’re optimized for different jobs. Parquet stores data column by column instead of row by row, which is a meaningful shift coming from CSVs. Because values in the same column are stored next to each other, Parquet files compress much more efficiently and let a query engine read only the columns it actually needs. Richardson’s census example made this concrete: his dataset has 311 columns, but a typical analysis might only touch two of them. Reading the other 309 just to throw them away is exactly the kind of waste that Parquet avoids. Parquet also stores data types alongside the data itself, which means there’s no guessing on read. That’s a problem I’ve already run into when reading messier CSVs into R.
The second idea is partitioning, which is essentially the file-system version of an index. By splitting a large dataset into many smaller files organized by a meaningful variable (like year or location), a query engine can skip entire files it doesn’t need. Hive-style partitioning (year=2005/) was new to me, but it’s an elegant convention: the partition variable and value are encoded right in the folder name, so the engine can filter without ever opening the file. Richardson cautioned against partitioning by high-cardinality variables, since too many tiny files create their own overhead.
The third idea connected most directly to what we’ve already done in this course. Richardson described a query engine as a database without the storage. It’s the part that does the work of optimizing and executing queries, separated from where the data actually lives. That reframed a lot of what I learned about BigQuery for me. In BigQuery, the engine and the storage are bundled into one managed service. With Arrow or DuckDB, the storage is just files on my own disk, and I can start an R session that becomes the query engine on top of them. The fact that Arrow and DuckDB share the same in-memory data structure means you can hand data back and forth between them without copying or translating anything, which is a real advantage when one tool happens to support a function the other doesn’t.
The overall message is that R users now have access to genuinely state-of-the-art analytic database technology just by installing a few packages from CRAN. No IT department, no separate database server, no new syntax to learn beyond dplyr.
3 Step 2: Introduction to Arrow (R-Ladies Ottawa Workshop)
Presenters: Nic Crane and Steph Hazlitt, R-Ladies Ottawa (English session)
This section is organized as a codebook following the structure of the workshop. The workshop uses the same Seattle library checkouts dataset that the R4DS Ch. 22 chapter uses and that Dr. Jung used in lecture. Code chunks that depend on the full 9 GB Seattle download are set to eval: false so the document renders quickly; expected output is shown beneath each chunk as comments (#>).
3.1 Setup and packages
The workshop uses four core packages. tidyverse is loaded for dplyr syntax, arrow provides the connection to Apache Arrow and Parquet, and dbplyr + duckdb are needed later when we bridge between Arrow and DuckDB.
Note from the workshop: Crane emphasized that
arrowis loaded in addition to the tidyverse, not as a replacement. The whole point is that the dplyr syntax you already know carries over.
3.2 What is Apache Arrow?
The workshop spent the first minutes on conceptual background before any code. The key distinctions to remember:
- Apache Arrow = the cross-language project and in-memory columnar specification.
arrow(lowercase) = the R package that wraps the Arrow C++ library.- Parquet = the on-disk file format that pairs naturally with Arrow.
- DuckDB = a separate, embeddable SQL query engine that shares Arrow’s memory format, so data can pass between them without copying.
Arrow stores data column-by-column rather than row-by-row, which is what makes column-subset queries fast.
3.3 Getting the data
The workshop downloads the Seattle library dataset (~9 GB CSV) using curl::multi_download(), which is preferred over download.file() because it supports resuming interrupted downloads and shows a progress bar.
dir.create("data", showWarnings = FALSE)
curl::multi_download(
"https://r4ds.s3.us-west-2.amazonaws.com/seattle-library-checkouts.csv",
"data/seattle-library-checkouts.csv",
resume = TRUE
)Note: If the download fails partway through, just re-run the same command. The
resume = TRUEargument picks up where it left off.
3.4 Opening a dataset
This is the central concept of the workshop. Instead of read_csv() (which loads everything into RAM), we use open_dataset(), which only scans enough of the file to figure out the schema, then stops.
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
col_types = schema(ISBN = string()),
format = "csv"
)
seattle_csv
#> FileSystemDataset with 1 csv file
#> UsageClass: string
#> CheckoutType: string
#> MaterialType: string
#> CheckoutYear: int64
#> CheckoutMonth: int64
#> Checkouts: int64
#> Title: string
#> ISBN: string
#> Creator: string
#> Subjects: string
#> Publisher: string
#> PublicationYear: stringHelpful note: The
col_types = schema(ISBN = string())argument is a workaround. The ISBN column is blank for the first ~80,000 rows, so without telling Arrow that it’s a string, it would guess the wrong type. This is a common gotcha. Be explicit about types when you know a column has many blanks at the start.
glimpse() works on Arrow datasets and confirms the size:
seattle_csv |> glimpse()
#> FileSystemDataset with 1 csv file
#> 41,389,465 rows x 12 columns3.5 Using dplyr verbs on Arrow datasets: lazy vs. eager
This is where the workshop’s “aha moment” lives. You write dplyr code as usual, but nothing actually runs until you call collect(). Until then, you’re just building up a query plan.
# Build the query. This returns instantly because nothing has executed
query <- seattle_csv |>
group_by(CheckoutYear) |>
summarise(Checkouts = sum(Checkouts)) |>
arrange(CheckoutYear)
query
#> FileSystemDataset (query)
#> CheckoutYear: int64
#> Checkouts: int64
#> ...
# Execute it and pull the results into R
query |> collect()
#> # A tibble: 18 × 2
#> CheckoutYear Checkouts
#> <int> <int>
#> 1 2005 3798685
#> 2 2006 6599318
#> ...Helpful note: Think of
collect()as the “go” button. The workshop framed it as: build your pipeline lazily, then onlycollect()when the data is small enough to fit in memory comfortably.
3.5.1 collect() vs compute() vs head()
The workshop spent time on these three closely related functions:
collect(): execute the query and return the result as an R tibble in memory.compute(): execute the query but keep the result as an Arrow Table (still out of RAM). Useful for intermediate steps in a long pipeline.head(n): execute and return just the firstnrows. Useful for previewing without pulling the whole result.
# Preview without pulling everything
seattle_csv |>
filter(CheckoutYear == 2021) |>
head(10) |>
collect()3.6 The Parquet format
The workshop’s second major theme is that CSV is a terrible format for large analytical data. Parquet’s advantages, as covered in the talk:
- Smaller files: efficient encoding plus compression.
- Self-describing types: no guessing what
"08-10-2022"means. - Columnar layout: reads only the columns the query needs.
- Chunked: different parts can be processed in parallel.
The tradeoff: Parquet files are binary, so you can’t open them in a text editor.
3.6.1 Writing to Parquet with partitioning
The workshop demonstrated converting the single 9 GB CSV into 18 partitioned Parquet files (one per year). The pattern is group_by() followed by write_dataset():
pq_path <- "data/seattle-library-checkouts"
seattle_csv |>
group_by(CheckoutYear) |>
write_dataset(path = pq_path, format = "parquet")This creates a folder structure using Hive-style partitioning:
data/seattle-library-checkouts/
├── CheckoutYear=2005/part-0.parquet (~109 MB)
├── CheckoutYear=2006/part-0.parquet (~164 MB)
├── CheckoutYear=2007/part-0.parquet (~178 MB)
└── ... (18 files total, ~4 GB combined)
Rules of thumb from the workshop:
- Aim for files between 20 MB and 2 GB each.
- Avoid creating more than ~10,000 partition files total.
- Partition by variables you’ll filter on. That’s how Arrow gets to skip files entirely.
3.7 Reading the partitioned dataset back
seattle_pq <- open_dataset(pq_path)Now the dataset is opened against the directory of Parquet files, not a single CSV. Queries automatically take advantage of the partitioning.
3.8 Performance comparison
The workshop ran the same query against both the CSV and the Parquet versions to show the difference:
# CSV version
seattle_csv |>
filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
group_by(CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
collect() |>
system.time()
#> user system elapsed
#> 11.951 1.297 11.387
# Parquet version
seattle_pq |>
filter(CheckoutYear == 2021, MaterialType == "BOOK") |>
group_by(CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
collect() |>
system.time()
#> user system elapsed
#> 0.263 0.058 0.063Why ~100x faster? Two reasons:
- Partitioning: Arrow reads only the
CheckoutYear=2021folder and ignores the other 17.- Columnar format: only the 4 columns used in the query (CheckoutYear, MaterialType, CheckoutMonth, Checkouts) are read from disk. The other 8 columns are skipped entirely.
3.9 Bridging Arrow and DuckDB
The final section of the workshop showed to_duckdb(), which converts an Arrow dataset into a DuckDB connection without copying the data. This matters when you want SQL or a function that DuckDB supports but Arrow’s query engine (Acero) doesn’t yet.
seattle_pq |>
to_duckdb() |>
filter(CheckoutYear >= 2018, MaterialType == "BOOK") |>
group_by(CheckoutYear) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(desc(CheckoutYear)) |>
collect()Key takeaway: Because Arrow and DuckDB share the same in-memory layout,
to_duckdb()is essentially free. There’s no serialization cost, and you can freely move between the two engines mid-pipeline.
3.10 Summary of the workshop
| Concept | Function / Tool | When to use it |
|---|---|---|
| Open a large file lazily | arrow::open_dataset() |
Any file/folder too big for RAM |
| Inspect structure | glimpse() |
Confirm row count and column types |
| Build query lazily | dplyr verbs (filter, group_by, etc.) |
Always; nothing runs until you collect() |
| Execute and pull to R | collect() |
When the result is small enough for memory |
| Execute but keep in Arrow | compute() |
Intermediate steps in a long pipeline |
| Convert CSV → Parquet | write_dataset(format = "parquet") |
One-time investment for any large CSV you’ll reuse |
| Partition data | group_by() before write_dataset() |
When you’ll repeatedly filter on a variable |
| Switch to DuckDB mid-pipeline | to_duckdb() |
When you need a SQL feature Arrow doesn’t support |
4 Step 3: Using arrow and duckdb to Wrangle Larger-than-RAM Medical Datasets
Presenter: Dr. Peter Higgins, R/Medicine Conference 2022
4.1 The problem Dr. Higgins is solving
Dr. Higgins opens with a concrete, motivating constraint: he wants to analyze the CMS Open Payments dataset, which records every payment from medical companies to physicians in the United States. The full dataset is roughly 94 GB across nine years of CSV files. His laptop has 16 GB of RAM. Standard R workflows that read entire datasets into memory simply will not work, since read_csv() would crash before it ever finished loading a single year’s file.
The talk walks through a workflow that turns this from an impossible problem into a routine one, using two packages: arrow for handling the larger-than-RAM data on disk, and duckdb for the moments where Arrow alone isn’t enough.
4.2 Why not just use data.table?
Before introducing Arrow, Dr. Higgins gives credit to data.table and its fread() function, which he notes is roughly 30× faster than base R for reading CSVs. For data that fits in RAM, data.table is excellent. But it has the same fundamental limit as dplyr: the data must fit in memory. With a 94 GB dataset and 16 GB of RAM, even fread() cannot help.
This sets up the core argument: speed alone is not enough. We need a strategy that never requires the whole dataset to be in memory at once.
4.3 The recommended workflow
Dr. Higgins lays out a four-step workflow that I’ll walk through with code:
4.3.1 Step 1: Open the dataset with arrow::open_dataset()
Instead of reading the file, you open a connection to it. Arrow scans the schema and stops there.
library(arrow)
library(dplyr)
library(duckdb)
# Point Arrow at the folder containing all 9 years of CSV files
payments <- open_dataset(
sources = "data/cms_open_payments/",
format = "csv"
)Important detail Dr. Higgins flagged: When opening a folder of multiple CSV files, Arrow expects them all to have the same schema (same column names, same column types, same column order). If even one file has a slightly different schema (for example, a column that’s
int32in one year andint64in another), Arrow will throw an error. He recommends opening individual files first to check, then building a unified schema explicitly usingschema()if needed.
4.3.2 Step 2: Wrangle to subset and reduce the data using dplyr verbs
This is where the magic happens. You write ordinary dplyr code, and Arrow translates it into efficient operations that run on disk, reading only the chunks of data it actually needs.
# Build a query that reduces 94 GB of payments to a manageable summary
top_recipients <- payments |>
filter(Total_Amount_of_Payment_USDollars > 1000) |>
group_by(Physician_First_Name, Physician_Last_Name, Recipient_State) |>
summarize(
total_received = sum(Total_Amount_of_Payment_USDollars, na.rm = TRUE),
n_payments = n()
) |>
arrange(desc(total_received))Key insight: Nothing has actually executed yet. Arrow has built a query plan, but no data has been read into memory. The
top_recipientsobject is just a description of work to be done, like a recipe rather than the finished dish.
4.3.3 Step 3: collect() the reduced result into R
Once the query has been wrangled down to something that fits in RAM, pull it into R for the rest of your analysis:
# Execute the query and pull the (now small) result into memory
top_recipients_df <- top_recipients |>
collect()
# Now we have an ordinary tibble we can plot, model, or summarize further
nrow(top_recipients_df)The general pattern Dr. Higgins emphasized: Filter and aggregate big, then collect small. The big dataset stays on disk; only the summarized result comes into R.
4.3.4 Step 4: Use DuckDB if further wrangling exceeds Arrow’s capabilities
This is the step that makes Step 3’s workflow most distinct from the basic Arrow tutorial. Sometimes you’ve reduced the data using Arrow, but the next step (maybe a more complex SQL window function, or a specific join type) isn’t yet supported by Arrow’s Acero query engine. In that case, you hand the data off to DuckDB mid-pipeline using to_duckdb():
# Start in Arrow (reading from disk), then switch to DuckDB for the join
payments |>
filter(Total_Amount_of_Payment_USDollars > 1000) |>
to_duckdb() |>
# Now we can use DuckDB-specific SQL features
left_join(physician_specialties, by = "Physician_Profile_ID") |>
group_by(Specialty) |>
summarize(total = sum(Total_Amount_of_Payment_USDollars)) |>
collect()Why this works without copying: Arrow and DuckDB share the same in-memory columnar layout.
to_duckdb()doesn’t serialize the data into a new format. It just hands DuckDB a pointer to the Arrow memory, which is what makes the bridge essentially free in terms of performance.
4.4 A reusable template for future projects
Putting all four steps together, here’s the pattern in compact form, ready to adapt for any larger-than-RAM dataset:
library(arrow)
library(dplyr)
library(duckdb)
# 1. Open the dataset (no data loaded yet)
ds <- open_dataset(
sources = "path/to/data/folder/",
format = "csv" # or "parquet"
)
# 2. Wrangle lazily with dplyr verbs
result <- ds |>
filter(some_column > threshold) |>
group_by(grouping_var) |>
summarize(metric = sum(value_col, na.rm = TRUE))
# 3a. If the result fits in RAM, collect to a tibble
final <- result |> collect()
# 3b. If further wrangling needs DuckDB features, hand it off first
final <- result |>
to_duckdb() |>
# ... DuckDB-specific operations ...
collect()5 References
- Richardson, N. (useR! Conference). Scaling Up Data Workflows with Arrow, Parquet, and DuckDB.
- Crane, N. & Hazlitt, S. (R-Ladies Ottawa). Introduction to Arrow.
- Higgins, P. (R/Medicine Conference 2022). Using the
{arrow}and{duckdb}packages to wrangle medical datasets that are larger than RAM. - Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. R for Data Science (2e), Chapter 22: Arrow. https://r4ds.hadley.nz/arrow.html
- Richardson, N., Crane, N., & Keane, J. Scaling Up with R and Arrow. https://arrowrbook.com/