R version 4.0.5
Big Data Problems in R:
Big Data Solutions in R: Apache 'Arrow'
Apache Arrow is a cross-language development platform for in-memory and larger-than-memory data.
Key features:
- Reading and writing data with Arrow Parquet files
- Analyzing in-memory and larger-than-memory datasets using Arrow tables
library(arrow)
What are Parquet files?
Parquet is a storage format designed for maximum space efficiency.
Advantages
Disadvantages
Dimensions:
Columns | Rows | |
---|---|---|
Dummy df | 31 | 10,000,000 |
Structure:
id | outcome | age | sex | surv_ha | surv_date | dx_1 | dx_2 | dx_3 | dx_4 | dx_5 | dx_6 | dx_7 | dx_8 | dx_9 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 62 | 0 | 4 | 2020-10-24 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 87 | 0 | 5 | 2016-06-10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | 40 | 1 | 3 | 2019-08-08 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 88 | 0 | 3 | 2015-11-01 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Example Code:
#Tidyverse CSV File
read_csv(here("a_data", "dummy_df.csv"))
#Data Table CSV File
fread(here("a_data", "dummy_df.csv"))
#RDS File
readRDS(here("a_data", "dummy_df.RDS"))
#Parquet File
read_parquet(here("a_data", "dummy_df.parquet"))
File Type | Seconds |
---|---|
Tidy CSV | 13.796720 |
DT CSV | 10.459450 |
RDS | 10.116643 |
Parquet | 3.958514 |
Example Code:
#Tidyverse CSV file
write_csv(df, here("a_data", "dummy_df.csv"))
#Data Table CSV file
fwrite(df, here("a_data", "dummy_df.csv"))
#RDS File
saveRDS(df, here("a_data", "dummy_df.RDS"))
#Parquet File
write_parquet(df, here("a_data", "dummy_df.parquet"))
#Compressed Parquet File
write_parquet(df, here("a_data", "dummy_df_comp.parquet"),
compression = "gzip")
File Type | Seconds |
---|---|
Tidy CSV | 69.905593 |
DT CSV | 3.558130 |
RDS | 38.063262 |
Parquet | 4.278639 |
Parquet compressed | 15.035137 |
Path | File Size (mb) |
---|---|
dummy_df.csv | 742.2 |
dummy_df.RDS | 90.7 |
dummy_df.parquet | 95.1 |
dummy_df_comp.parquet | 59.9 |
What are Arrow tables?
An Arrow table is a two-dimensional dataset with chunked arrays for columns, together with a schema providing field names.
Advantages
Disadvantages
Example Code:
opts_chunk$set(cache=TRUE)
#Read in Parquet file as Arrow table
df_arrow <- read_parquet(here("a_data", "dummy_df.parquet"),
as_data_frame = FALSE)
#Convert to dataframe
df_arrow <- df_arrow %>%
collect()
#Convert back to Arrow table
df_arrow <- arrow_table(df_arrow)
Data Type | Object Size |
---|---|
Data frame | 1182.6 Mb |
Arrow table | 0 Mb |
Query:
#Tidyverse Method
df %>%
group_by(surv_ha) %>%
summarise(mean_age = mean(age)) %>%
ungroup()
#Data Table Method
df_dt[,
.(mean_age = mean(age)),
by = surv_ha]
#Arrow and Tidyverse Method
df_arrow %>%
group_by(surv_ha) %>%
summarise(mean_age = mean(age))
File Type | Seconds |
---|---|
Tidy Table | 0.1322553 |
Data Table | 0.2089788 |
Arrow Table | 0.1090925 |
Query:
#Tidyverse Method
df %>%
left_join(ha_xwalk)
#Data Table Method
df_dt[
setDT(ha_xwalk), on = .(surv_ha)]
#Arrow and Tidyverse Method
df_arrow %>%
left_join(ha_xwalk)
File Type | Seconds |
---|---|
Tidy Table | 1.0713986 |
Data Table | 1.0738668 |
Arrow Table | 0.0869529 |
Query:
#Tidyverse Method
read_csv(here("a_data", "dummy_df.csv")) %>%
filter(surv_ha %in% c("1", "2", "3")) %>%
select(1:6)
#Data Table Method
fread(here("a_data", "dummy_df.csv"), data.table = TRUE,
select = c("id", "outcome", "age", "sex", "surv_ha", "surv_date"))[
surv_ha %in% c("1", "2", "3")
]
#Arrow and Tidyverse Method
read_parquet(here("a_data", "dummy_df_comp.parquet"),
as_data_frame = FALSE) %>%
filter(surv_ha %in% c("1", "2", "3")) %>%
select(1:6) %>%
collect()
File Type | Seconds |
---|---|
Tidy CSV | 80.689017 |
DT CSV | 1.849376 |
Arrow | 6.910698 |
library(collapse)
Common Method | Collapse Method |
---|---|
as_tibble | qTBL |
setDT | qDT |
as.matrix | qM |
Conversion Method | Seconds |
---|---|
as_tibble | 0.0949033 |
qTBL | 0.0000539 |
setDT | 0.0263830 |
qDT | 0.0185861 |
as.matrix | 109.7313930 |
qM | 1.8972162 |
Note for Central Analytics Platform Users:
- The Arrow table functions are only available on R versions >= 4.0.2
- To install Arrow on R versions < 4.0.2 install without compilation, if issues persist install package 'Rcpp' v1.0.1
Note for Central Analytics Platform Users:
- The Arrow table functions are only available on R versions >= 4.0.2
- To install Arrow on R versions < 4.0.2 install without compilation, if issues persist install package 'Rcpp' v1.0.1