Handling Big Data with Arrow

R version 4.0.5

Introduction

Big Data Problems in R:


Big Data Solutions in R: Apache 'Arrow'

Apache Arrow is a cross-language development platform for in-memory and larger-than-memory data.

Key features:

- Reading and writing data with Arrow Parquet files

- Analyzing in-memory and larger-than-memory datasets using Arrow tables

library(arrow)

Reading and Writing Data: Arrow Parquet Files

What are Parquet files?

Parquet is a storage format designed for maximum space efficiency.

Advantages

Disadvantages

Dummy Data for these exercises


Dimensions:

Columns Rows
Dummy df 31 10,000,000



Structure:

id outcome age sex surv_ha surv_date dx_1 dx_2 dx_3 dx_4 dx_5 dx_6 dx_7 dx_8 dx_9
1 1 62 0 4 2020-10-24 0 0 0 0 1 0 0 0 0
2 1 87 0 5 2016-06-10 0 0 0 0 0 0 0 0 1
3 1 40 1 3 2019-08-08 0 0 0 1 0 0 0 0 0
4 1 88 0 3 2015-11-01 1 1 0 0 0 0 0 0 0

Reading Data: Comparing Parquet to other methods

Example Code:

#Tidyverse CSV File
read_csv(here("a_data", "dummy_df.csv"))

#Data Table CSV File
fread(here("a_data", "dummy_df.csv"))

#RDS File
readRDS(here("a_data", "dummy_df.RDS"))

#Parquet File
read_parquet(here("a_data", "dummy_df.parquet"))
Mean Read Time:
File Type Seconds
Tidy CSV 13.796720
DT CSV 10.459450
RDS 10.116643
Parquet 3.958514

Writing Data: Comparing Parquet to other file types

Example Code:

#Tidyverse CSV file
write_csv(df, here("a_data", "dummy_df.csv"))

#Data Table CSV file
fwrite(df, here("a_data", "dummy_df.csv"))

#RDS File
saveRDS(df, here("a_data", "dummy_df.RDS"))

#Parquet File
write_parquet(df, here("a_data", "dummy_df.parquet"))

#Compressed Parquet File
write_parquet(df, here("a_data", "dummy_df_comp.parquet"),
              compression = "gzip")
Mean Write Time:
File Type Seconds
Tidy CSV 69.905593
DT CSV 3.558130
RDS 38.063262
Parquet 4.278639
Parquet compressed 15.035137


File Size:
Path File Size (mb)
dummy_df.csv 742.2
dummy_df.RDS 90.7
dummy_df.parquet 95.1
dummy_df_comp.parquet 59.9

Analyzing Data: Arrow Tables

What are Arrow tables?

An Arrow table is a two-dimensional dataset with chunked arrays for columns, together with a schema providing field names.

Advantages

Disadvantages

R Memory: Comparing Arrow table to data frame

Example Code:

opts_chunk$set(cache=TRUE)

#Read in Parquet file as Arrow table
df_arrow <- read_parquet(here("a_data", "dummy_df.parquet"),
                         as_data_frame = FALSE)

#Convert to dataframe
df_arrow <- df_arrow %>%
  collect()

#Convert back to Arrow table
df_arrow <- arrow_table(df_arrow)
R Memory Allocation:
Data Type Object Size
Data frame 1182.6 Mb
Arrow table 0 Mb

Querying: Comparing tidy, data table, and arrow

Query:

#Tidyverse Method
df %>%
  group_by(surv_ha) %>%
  summarise(mean_age = mean(age)) %>%
  ungroup()

#Data Table Method                                    
df_dt[,
      .(mean_age = mean(age)),
      by = surv_ha]
                       
#Arrow and Tidyverse Method           
df_arrow %>%
  group_by(surv_ha) %>%
  summarise(mean_age = mean(age))
Mean Query Time:
File Type Seconds
Tidy Table 0.1322553
Data Table 0.2089788
Arrow Table 0.1090925

Querying: Comparing tidy, data table, and arrow

Query:

#Tidyverse Method
df %>%
  left_join(ha_xwalk)

#Data Table Method                 
df_dt[
  setDT(ha_xwalk), on = .(surv_ha)]

#Arrow and Tidyverse Method                
df_arrow %>%
  left_join(ha_xwalk)
Mean Query Time:
File Type Seconds
Tidy Table 1.0713986
Data Table 1.0738668
Arrow Table 0.0869529

Querying: Comparing tidy, data table, and arrow

Query:

#Tidyverse Method
read_csv(here("a_data", "dummy_df.csv")) %>%
  filter(surv_ha %in% c("1", "2", "3")) %>%
  select(1:6)

#Data Table Method                 
fread(here("a_data", "dummy_df.csv"), data.table = TRUE,
      select = c("id", "outcome", "age", "sex", "surv_ha", "surv_date"))[
  surv_ha %in% c("1", "2", "3")
  ]

#Arrow and Tidyverse Method                
read_parquet(here("a_data", "dummy_df_comp.parquet"),
             as_data_frame = FALSE)  %>%
  filter(surv_ha %in% c("1", "2", "3")) %>%
  select(1:6) %>%
  collect()
Mean Query Time:
File Type Seconds
Tidy CSV 80.689017
DT CSV 1.849376
Arrow 6.910698

BONUS: Using Collapse to convert dataframes to tibbles, data tables, or matrices

library(collapse)
Common Method Collapse Method
as_tibble qTBL
setDT qDT
as.matrix qM
Mean Coversion Time:
Conversion Method Seconds
as_tibble 0.0949033
qTBL 0.0000539
setDT 0.0263830
qDT 0.0185861
as.matrix 109.7313930
qM 1.8972162

Key Takeaways









Note for Central Analytics Platform Users:
- The Arrow table functions are only available on R versions >= 4.0.2
- To install Arrow on R versions < 4.0.2 install without compilation, if issues persist install package 'Rcpp' v1.0.1

Resources

Arrow Information Page

Arrow Cheat Sheet

Collapse Cheat Sheet

Braeden's Github













Note for Central Analytics Platform Users:
- The Arrow table functions are only available on R versions >= 4.0.2
- To install Arrow on R versions < 4.0.2 install without compilation, if issues persist install package 'Rcpp' v1.0.1