dbverse:
composable database libraries for larger-than-memory scientific analytics

Edward C. Ruiz

ecruiz@bu.edu

Ph.D. Candidate, Dries Lab, Boston University

August 15, 2024

Motivation

Current challenges with scientific data analysis

Scientific data is often messy and complex.
- Not your standard dataframe
- Heterogeneous, multi-modal (e.g. “multi-omics”)
- Larger-than-memory (e.g. spatial multi-omics)
A variety of tools and languages are used to analyze scientific data.
- Interoperability is often limited
- Fragmentation by data type
How can we develop better approaches for scientific data analysis?

dbverse overview

`{dbverse}` adopts familiar syntax

Example with `{dbMatrix}`

Matrix

rownames(dgc)[1:5]

[1] "Gna12"  "Ccnd2"  "Btbd17" "Sox9"   "Sez6"

colnames(dgc)[1:5]

[1] "AAAGGGATGTAGCAAG-1" "AAATGGCATGTCTTGT-1" "AAATGGTCAATGTGCC-1"
[4] "AAATTAACGGGTAGCT-1" "AACAACTGGTAGTTGC-1"

dim(dgc)

[1] 634 624

dbMatrix

Loaded dbMatrix

rownames(dbMatrix)[1:5]

[1] "Gna12"  "Ccnd2"  "Btbd17" "Sox9"   "Sez6"

colnames(dbMatrix)[1:5]

[1] "AAAGGGATGTAGCAAG-1" "AAATGGCATGTCTTGT-1" "AAATGGTCAATGTGCC-1"
[4] "AAATTAACGGGTAGCT-1" "AACAACTGGTAGTTGC-1"

dim(dgc)

[1] 634 624

How does it work? dbMatrix example

dbMatrix adopts familiar {Matrix} syntax…

scaled <- dbMatrix[,"cell_1"] * 10

with underlying methods implemented with {dplyr} …

scaled <- dplyr::tbl(con, "dbMatrix") |>
  dplyr::select(cell_id = "cell_1") |>
  dplyr::mutate(scaled = expression * 10)

which are transpiled to SQL via {dbplyr}…

SELECT cell_id, expression * 10 AS scaled
FROM my_cells.db WEHRE cell_id = 'cell_1';

and lazy evaluated in a DuckDB database 🐥🚀!

Illustrative `{dbMatrix}` benchmark

`{dbMatrix}` performs larger-than-memory sparse matrix operations and outperforms HDF5Matrix

norm_mat <- t(t(dbMatrix) / libsizes) * scalefactor
lib_norm_mat <- log(norm_mat + offset) / log(base)
log_norm_mat <- t(norm_mat) - colMeans(lib_norm_mat)
# ...additional matrix operations

Illustrative `{dbSpatial}` benchmark

`{dbSptaial}` outperforms existing in-memory methods for spatial intersections

Task: find the intersection between cell polygons in tissue region of interests (ROIs)

Median Runtime (seconds; 5X queries)
	No. Polygons	dbSpatial	sf (memory)	Δ Performance
ROI 1	1564	0.05	0.480	9X
ROI 2	92498	1.56	39.024	25X
ROI 3	143245	2.55	65.398	25X

Illustrative `{dbSequence}` benchmark

`{dbSequence}` outperforms competing methods

Task: filter reads in a genomic *.bam file (28GB, 285e6 reads)

samtools v1.20 (Li et al. 2009)
q01: chromosome region
q02: q01 + flag
q03: q02 + CIGAR string (samtools + awk)

Median Runtime (seconds; 5X queries)
Query	dbSequence	samtools	Δ Speed
q01	0.03400	0.08	2X
q02	0.00622	0.02	3X
q03	19.26000	DNC	∞

Conclusions

Advantages of using DuckDB for scientific data analysis

Runs on modern laptops: All previous benchmarks were performed on a Macbook Pro M2, 16GB RAM, 512GB SSD
Open Source: MIT license
Platform-independent: Runs on all major OS
Portable: Share results in a single *.db file
Affordable: Free to use, pay for more local storage as needed or ‘hybrid execution’ with MotherDuck
First release: 08/15/2024 (today 🎉)

Limitations and future directions

dbverse is currently only compatible with R
- Plan to support other languages (e.g. Python)
Limited visualization/plotting functionality
- uwdata/mosaic integration, see discussion #354
Limited support for large images
- DuckDB Spatial Extension Raster support
Plans to adopt duckplyr
- See duckdblabs/duckplyr issue #86
… and much more!

Acknowledgements

Ruben Dries Lab

Jiaji George Chen️
Iqra Amin
Wonyl Choi
Junxiang Xu
Yibing Michelle Wei
Jeffrey Sheridan
Quynh Sun
Veronica Jarzabek

Funding

Questions?

To learn more please visit:
https://drieslab.github.io/dbverse/

🐦👨🏽‍💻@Ed2uiz | ✉️ ecruiz@bu.edu

dbverse: composable database libraries for larger-than-memory scientific analytics