Using parquet files makes everything better

2023-06-07

Intro to parquet

A cross-platform (R / Python etc.) binary file type.
Supports modern compression - Zstandard
Stores tabular data (e.g. tibbles/dataframes) by column.
The fastest tested file type for reading and writing over the PHS PWB connection.
Low memory usage as standard, but has optional features to improve things further.

Writing a parquet file

To write a parquet file we use the {arrow} package.
write_parquet(data, path) - Default compression ‘snappy’.
write_parquet(data, path, compression = "zstd") - Great compression using ‘Zstandard’

Using more compression will take up less space on the disk but might be (slightly) slower to write, the read times however are mostly unaffected by the compression level.

Reading a parquet file

To read we use read_parquet(file), also from the {arrow} package.

library(arrow)

spd_path <- fs::path(
  "/conf/linkage/output/lookups/Unicode/Geography",
  "Scottish Postcode Directory",
  "Scottish_Postcode_Directory_2023_1.parquet"
)

read_parquet(file = spd_path)

Benchmarking - Writing to Stats

Writing an extract of 1 million rows from the SMR01 dataset to the Stats server. Times presented below are in seconds.

Package	File Format	Compression	Minimum Time	Median Time
{arrow}	parquet	ZStandard	39.57	39.77
{arrow}	parquet	Snappy	46.53	46.59
{readr}	csv	Uncompressed	174	174
{base}	rds	Default Compression	222.6	227.4
{fst}	fst	Default Compression	978	980.4

Benchmarking - Reading from Stats

Reading an extract of the SMR01 dataset containing 1 million rows, and aggregating to present a count of the number of episodes by location. Times presented below are in seconds.

Package	File Format	Compression	Minimum Time	Median Time
{fst}	fst	Default Compression	0.34799	0.35548
{arrow}	parquet	ZStandard	0.37768	0.38025
{arrow}	parquet	Snappy	0.38535	0.3921
{readr}	csv	Uncompressed	4.06	4.13
{base}	rds	Default Compression	17.43	17.52

Ex 1 - Reading the SPD

The SPD exists on Stats in a few different formats, we can read them all and compare them using the {bench} package.

bench::mark(
  rds = readr::read_rds(fs::path_ext_set(spd_path, "rds")),
  csv = readr::read_csv(fs::path_ext_set(spd_path, "csv")),
  csv_zip = readr::read_csv(fs::path_ext_set(spd_path, "zip")),
  parquet = read_parquet(file = spd_path)
)

Ex 1 - Reading the SPD

## # A tibble: 4 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 rds           2.39s    2.39s     0.419   137.2MB    0.419
## 2 csv           4.39s    4.39s     0.228   162.1MB    0.455
## 3 csv_zip       4.11s    4.11s     0.244  297.28MB    0.731
## 4 parquet    592.52ms 592.52ms     1.69     9.55MB    0

Ex 2 - Reading a file and performing an aggregate

The speed-up and memory savings are maintained when doing analysis. This example reads the SPD and then does a count (of postcodes) for each HSCP name.

bench::mark(
  rds = readr::read_rds(fs::path_ext_set(spd_path, "rds")) %>% 
    dplyr::count(hscp2019name),
  csv = readr::read_csv(fs::path_ext_set(spd_path, "csv")) %>% 
    dplyr::count(hscp2019name),
  csv_zip = readr::read_csv(fs::path_ext_set(spd_path, "zip")) %>% 
    dplyr::count(hscp2019name),
  parquet = read_parquet(file = spd_path) %>% 
    dplyr::count(hscp2019name)
)

Ex 2 - Reading a file and performing an aggregate

## # A tibble: 4 × 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 rds           2.55s    2.55s     0.392   139.6MB    0.783
## 2 csv           4.37s    4.37s     0.229   157.5MB    0.458
## 3 csv_zip       4.15s    4.15s     0.241   301.3MB    0.723
## 4 parquet    631.82ms 631.82ms     1.58     10.6MB    0

Bonus features - Further reading

The {arrow} package gives some extra features, not necessarily exclusive to the parquet format. These ‘bonus features’ can speed up and reduce memory usage even further, especially when dealing with large data files!

You can read only specific columns read_parquet(file, col_select = c(var1, var2)). read_parquet() reference.
You can read and do super-efficient analyses using arrow’s ‘Arrow Table’ feature. read_parquet(file, as_data_frame = FALSE) %>% ... %>% collect() Data wrangling article
You can use arrow’s ‘Dataset’ feature to store and use files with obvious groupings efficiently. Dataset article