Handling Big Data with Arrow

R version 4.0.5

Introduction

Big Data Problems in R:

R does all it's work in memory, which is limited by a PC's RAM
Overloading R memory can lead to frequent crashing
Reading, Writing, and Querying large datasets is time consuming
Large data outputs take up server space

Big Data Solutions in R: Apache 'Arrow'

Apache Arrow is a cross-language development platform for in-memory and larger-than-memory data.

Key features:

- Reading and writing data with Arrow Parquet files

- Analyzing in-memory and larger-than-memory datasets using Arrow tables

library(arrow)

Reading and Writing Data: Arrow Parquet Files

What are Parquet files?

Parquet is a storage format designed for maximum space efficiency.

Advantages

Saves data in columnar format, which is more efficient than row-based files like CSV
Can be compressed, requiring minimal storage on the server
Can be read and written by other programs (eg. Python)

Disadvantages

Not human readable (stored in binary)
Minimal/no benefit to using this format when working with small datasets

Dummy Data for these exercises

Dimensions:

	Columns	Rows
Dummy df	31	10,000,000

Structure:

id	outcome	age	sex	surv_ha	surv_date	dx_1	dx_2	dx_4	dx_5	dx_9
1	1	62	0	4	2020-10-24	0	0	0	1	0
2	1	87	0	5	2016-06-10	0	0	0	0	1
3	1	40	1	3	2019-08-08	0	0	1	0	0
4	1	88	0	3	2015-11-01	1	1	0	0	0

Reading Data: Comparing Parquet to other methods

Example Code:

#Tidyverse CSV File
read_csv(here("a_data", "dummy_df.csv"))

#Data Table CSV File
fread(here("a_data", "dummy_df.csv"))

#RDS File
readRDS(here("a_data", "dummy_df.RDS"))

#Parquet File
read_parquet(here("a_data", "dummy_df.parquet"))

Mean Read Time:

File Type	Seconds
Tidy CSV	13.796720
DT CSV	10.459450
RDS	10.116643
Parquet	3.958514

Writing Data: Comparing Parquet to other file types

Example Code:

#Tidyverse CSV file
write_csv(df, here("a_data", "dummy_df.csv"))

#Data Table CSV file
fwrite(df, here("a_data", "dummy_df.csv"))

#RDS File
saveRDS(df, here("a_data", "dummy_df.RDS"))

#Parquet File
write_parquet(df, here("a_data", "dummy_df.parquet"))

#Compressed Parquet File
write_parquet(df, here("a_data", "dummy_df_comp.parquet"),
              compression = "gzip")

Mean Write Time:

File Type	Seconds
Tidy CSV	69.905593
DT CSV	3.558130
RDS	38.063262
Parquet	4.278639
Parquet compressed	15.035137

File Size:

Path	File Size (mb)
dummy_df.csv	742.2
dummy_df.RDS	90.7
dummy_df.parquet	95.1
dummy_df_comp.parquet	59.9

Analyzing Data: Arrow Tables

What are Arrow tables?

An Arrow table is a two-dimensional dataset with chunked arrays for columns, together with a schema providing field names.

Advantages

Requires virtually no R memory to have in the environment
Parquet files can be read into R as an Arrow table
Can be turned into a dataframe or data table at any time
Any data frame or data table can be converted to an arrow table at any time
Can query these tables prior to converting to a data frame or data table

Disadvantages

Cannot view the table without pulling it into memory
Querying is limited to equivalent code available in kernel (eg. no > or < functions)

R Memory: Comparing Arrow table to data frame

Example Code:

opts_chunk$set(cache=TRUE)

#Read in Parquet file as Arrow table
df_arrow <- read_parquet(here("a_data", "dummy_df.parquet"),
                         as_data_frame = FALSE)

#Convert to dataframe
df_arrow <- df_arrow %>%
  collect()

#Convert back to Arrow table
df_arrow <- arrow_table(df_arrow)

R Memory Allocation:

Data Type	Object Size
Data frame	1182.6 Mb
Arrow table	0 Mb

Querying: Comparing tidy, data table, and arrow

Query:

Group by HA
Calculate average age

#Tidyverse Method
df %>%
  group_by(surv_ha) %>%
  summarise(mean_age = mean(age)) %>%
  ungroup()

#Data Table Method                                    
df_dt[,
      .(mean_age = mean(age)),
      by = surv_ha]
                       
#Arrow and Tidyverse Method           
df_arrow %>%
  group_by(surv_ha) %>%
  summarise(mean_age = mean(age))

Mean Query Time:

File Type	Seconds
Tidy Table	0.1322553
Data Table	0.2089788
Arrow Table	0.1090925

Querying: Comparing tidy, data table, and arrow

Query:

Left join on HA crosswalk file

#Tidyverse Method
df %>%
  left_join(ha_xwalk)

#Data Table Method                 
df_dt[
  setDT(ha_xwalk), on = .(surv_ha)]

#Arrow and Tidyverse Method                
df_arrow %>%
  left_join(ha_xwalk)

Mean Query Time:

File Type	Seconds
Tidy Table	1.0713986
Data Table	1.0738668
Arrow Table	0.0869529

Querying: Comparing tidy, data table, and arrow

Query:

Read, filter, and select columns

#Tidyverse Method
read_csv(here("a_data", "dummy_df.csv")) %>%
  filter(surv_ha %in% c("1", "2", "3")) %>%
  select(1:6)

#Data Table Method                 
fread(here("a_data", "dummy_df.csv"), data.table = TRUE,
      select = c("id", "outcome", "age", "sex", "surv_ha", "surv_date"))[
  surv_ha %in% c("1", "2", "3")
  ]

#Arrow and Tidyverse Method                
read_parquet(here("a_data", "dummy_df_comp.parquet"),
             as_data_frame = FALSE)  %>%
  filter(surv_ha %in% c("1", "2", "3")) %>%
  select(1:6) %>%
  collect()

Mean Query Time:

File Type	Seconds
Tidy CSV	80.689017
DT CSV	1.849376
Arrow	6.910698

BONUS: Using Collapse to convert dataframes to tibbles, data tables, or matrices

library(collapse)

Common Method	Collapse Method
as_tibble	qTBL
setDT	qDT
as.matrix	qM

Mean Coversion Time:

Conversion Method	Seconds
as_tibble	0.0949033
qTBL	0.0000539
setDT	0.0263830
qDT	0.0185861
as.matrix	109.7313930
qM	1.8972162

Key Takeaways

Parquet files are efficient and functional for big data
Parquet files can be directly read into R as Arrow tables
Arrow tables do not use R memory
Arrow tables can be queried before being called into R memory
Dataframes can be converted to arrow tables and vice versa
In some cases querying with Arrow is faster, particularly complicated tasks like joins

Note for Central Analytics Platform Users:
- The Arrow table functions are only available on R versions >= 4.0.2
- To install Arrow on R versions < 4.0.2 install without compilation, if issues persist install package 'Rcpp' v1.0.1

Resources

Arrow Information Page

Arrow Cheat Sheet

Collapse Cheat Sheet

Braeden's Github

Note for Central Analytics Platform Users:
- The Arrow table functions are only available on R versions >= 4.0.2
- To install Arrow on R versions < 4.0.2 install without compilation, if issues persist install package 'Rcpp' v1.0.1