dbverse:
composable database libraries for larger-than-memory scientific analytics
Edward C. Ruiz
Ph.D. Candidate, Dries Lab, Boston University
August 15, 2024
Motivation
Current challenges with scientific data analysis
- Scientific data is often messy and complex.
- Not your standard dataframe
- Heterogeneous, multi-modal (e.g. “multi-omics”)
- Larger-than-memory (e.g. spatial multi-omics)
- A variety of tools and languages are used to analyze scientific data.
- Interoperability is often limited
- Fragmentation by data type
- How can we develop better approaches for scientific data analysis?
dbverse overview
{dbverse} adopts familiar syntax
Example with {dbMatrix}
Matrix
[1] "Gna12" "Ccnd2" "Btbd17" "Sox9" "Sez6"
[1] "AAAGGGATGTAGCAAG-1" "AAATGGCATGTCTTGT-1" "AAATGGTCAATGTGCC-1"
[4] "AAATTAACGGGTAGCT-1" "AACAACTGGTAGTTGC-1"
dbMatrix
[1] "Gna12" "Ccnd2" "Btbd17" "Sox9" "Sez6"
[1] "AAAGGGATGTAGCAAG-1" "AAATGGCATGTCTTGT-1" "AAATGGTCAATGTGCC-1"
[4] "AAATTAACGGGTAGCT-1" "AACAACTGGTAGTTGC-1"
How does it work? dbMatrix example
dbMatrix adopts familiar {Matrix} syntax…
scaled <- dbMatrix[,"cell_1"] * 10
with underlying methods implemented with {dplyr} …
scaled <- dplyr::tbl(con, "dbMatrix") |>
dplyr::select(cell_id = "cell_1") |>
dplyr::mutate(scaled = expression * 10)
which are transpiled to SQL via {dbplyr}…
SELECT cell_id, expression * 10 AS scaled
FROM my_cells.db WEHRE cell_id = 'cell_1';
and lazy evaluated in a DuckDB database 🐥🚀!
Illustrative {dbMatrix} benchmark
norm_mat <- t(t(dbMatrix) / libsizes) * scalefactor
lib_norm_mat <- log(norm_mat + offset) / log(base)
log_norm_mat <- t(norm_mat) - colMeans(lib_norm_mat)
# ...additional matrix operations
Illustrative {dbSpatial} benchmark
Task: find the intersection between cell polygons in tissue region of interests (ROIs)
Median Runtime (seconds; 5X queries)
| ROI 1 |
1564 |
0.05 |
0.480 |
9X |
| ROI 2 |
92498 |
1.56 |
39.024 |
25X |
| ROI 3 |
143245 |
2.55 |
65.398 |
25X |
Illustrative {dbSequence} benchmark
Task: filter reads in a genomic *.bam file (28GB, 285e6 reads)
samtools v1.20 (Li et al. 2009)
- q01: chromosome region
- q02: q01 + flag
- q03: q02 + CIGAR string (
samtools + awk)
Median Runtime (seconds; 5X queries)
| q01 |
0.03400 |
0.08 |
2X |
| q02 |
0.00622 |
0.02 |
3X |
| q03 |
19.26000 |
DNC |
∞ |
Conclusions
Advantages of using DuckDB for scientific data analysis
- Runs on modern laptops: All previous benchmarks were performed on a Macbook Pro M2, 16GB RAM, 512GB SSD
- Open Source: MIT license
- Platform-independent: Runs on all major OS
- Portable: Share results in a single
*.db file
- Affordable: Free to use, pay for more local storage as needed or ‘hybrid execution’ with MotherDuck
- First release:
08/15/2024 (today 🎉)
Limitations and future directions
dbverse is currently only compatible with R
- Plan to support other languages (e.g. Python)
- Limited visualization/plotting functionality
uwdata/mosaic integration, see discussion #354
- Limited support for large images
- DuckDB Spatial Extension Raster support
- Plans to adopt
duckplyr
- See
duckdblabs/duckplyr issue #86
- … and much more!
Acknowledgements
Ruben Dries Lab
- Jiaji George Chen️
- Iqra Amin
- Wonyl Choi
- Junxiang Xu
- Yibing Michelle Wei
- Jeffrey Sheridan
- Quynh Sun
- Veronica Jarzabek
Funding
![]()