DuckDB for R Users

0.1 DuckDB for R Users

DuckDB is an OLAP database used by data professionals, such as data scientists and analysts, to analyze data in a fast and efficient manner. It leverages a SQL query execution engine capable of running complex queries on large datasets1. Essentially, it’s like having a quacking-fast analytics-focused database right on your local computer!

Certainly! DuckDB is an in-memory analytical database system that integrates seamlessly with R. Here’s a concise guide on how to interact with DuckDB using R:

0.1.1 Installation and Loading:

Install the duckdb package from CRAN using install.packages(“duckdb”). Load the package in your R session using library(duckdb).

0.1.2 Connecting to DuckDB:

Create a connection object to DuckDB. You have two options: In-Memory Connection: con <- DBI::dbConnect(duckdb::duckdb(), dbdir = “:memory:”, read_only = FALSE) File-Based Connection (from an existing DuckDB file): con <- DBI::dbConnect(duckdb::duckdb(), dbdir = “path/to/your/duckdb_file.db”, read_only = FALSE) Importing Data:

Once connected, you can import data into DuckDB using standard R data frames or other data sources. For example, if you have a data frame my_data, you can insert it into DuckDB:


dbWriteTable(con, "my_table", my_data)

0.1.3 Querying and Analyzing Data:

Execute SQL queries directly on DuckDB using the dbGetQuery() function: result <- dbGetQuery(con, “SELECT * FROM my_table WHERE column_name > 100”)

0.1.4 Closing the Connection:

Always close the connection when you’re done: dbDisconnect(con)

## Why Use DuckDB?
When working with the R programming language and DuckDB, you can efficiently update tables using the following approaches:
Using SQL Queries:
DuckDB supports SQL queries, allowing you to perform various operations on tables. For example, you can use DBI::dbExecute() to execute SQL statements that modify data in the database1. To update a table, write an appropriate SQL UPDATE statement and execute it using dbExecute(). For instance, if you want to update a column in the prices table, you can do something like this: # Example: Updating a 60-day moving average in DuckDB query <- “UPDATE prices SET moving_avg_60 = (SELECT AVG(price) FROM prices p2 WHERE p2.date BETWEEN prices.date - 59 AND prices.date)” DBI::dbExecute(con, query) Using Temporary Tables and Bulk Operations:
For efficient bulk updates, consider using temporary tables. First, create an empty temporary table matching the structure of your main table. Then, load data into the temporary table and perform an INSERT INTO operation to update the main table2. Here’s an example of how you might use a temporary table to update data: # Create an empty temporary table matching the main table structure ct <- “CREATE OR REPLACE TEMP TABLE stg AS SELECT * FROM main WHERE 1 = 2” DBI::dbExecute(con, ct)
# Load data into the temporary table (stg) dbAppendTable(con, “stg”, data_to_update)
# Perform an INSERT INTO operation to update the main table query <- “INSERT INTO main SELECT * FROM stg ON CONFLICT DO NOTHING” DBI::dbExecute(con, query) Remember that DuckDB provides advantages like efficient handling of large data, SQL support, and lightweight performance, making it a great choice for managing data in R1. If you have any specific use case or need further assistance, feel free to ask! 😊🦆

Certainly! To use a DuckDB local database inside a Shiny app, follow these steps:

Install Required Packages: Make sure you have the necessary R packages installed. You’ll need arrow, dplyr, lubridate, duckdb, and stringr. If any of these are missing, run install.packages(“package_name”) in the R console1.

Prepare Your Data: Download the monthly Yellow Taxi trip data (2019-2023) in Parquet format. You’ll end up with 60 individual data files, each around 3.3 GB. Your Shiny app will load data one year (12 Parquet files) at a time1.

Read Data Using Dplyr: Write a function that reads one year’s worth of Parquet files from disk and performs aggregation (e.g., basic statistics like number of rides, average duration, etc.). You’ll need to dynamically load data from the scattered Parquet files1.

Read Data Using DuckDB: Create a function similar to the one above, but use DuckDB instead of dplyr. DuckDB allows you to write aggregations using dplyr-like functions or SQL. You can use a Glob pattern to specify file locations instead of creating a list in a loop2.

Deploy Your Shiny App: When deploying your Shiny app, ensure that the DuckDB package is installed. Deployment options (e.g., AWS EC2, EKS, ECS) may affect how you set up DuckDB on your server3.

Remember to organize your data files and handle connections appropriately within your Shiny app. Happy coding

DuckDB and SQLite are both popular embedded databases, but they have different strengths and are suited to different use cases. Here’s a comparison:

0.1.5 Design Focus

DuckDB: Optimized for analytical processing and complex SQL queries. It is often referred to as the “SQLite for analytics”¹.
SQLite: Designed for simplicity and ease of use, with a focus on being a lightweight, zero-configuration database¹.

0.1.6 Performance

DuckDB: Excels in handling complex queries and large datasets, leveraging columnar storage and parallel processing to improve performance¹².
SQLite: Performs well for simple, transactional workloads and smaller datasets. It is not optimized for heavy analytical queries².

0.1.7 Concurrency

DuckDB: Supports concurrent reads but only one write operation at a time¹.
SQLite: Similar to DuckDB, it allows multiple read operations but only one write operation at a time¹.

0.1.8 Storage

DuckDB: Uses columnar storage, which is beneficial for analytical workloads as it allows for better compression and faster read performance¹.
SQLite: Uses row-based storage, which is more suited for transactional workloads¹.

0.1.9 Integration

DuckDB: Integrates well with data science tools and environments like Python, R, and Jupyter notebooks¹.
SQLite: Widely used in various applications, from mobile apps to web browsers, due to its simplicity and ease of integration¹.

0.1.10 Use Cases

DuckDB: Ideal for data analysis, complex queries, and scenarios requiring high performance on large datasets¹.
SQLite: Best for applications needing a lightweight, embedded database with simple transactional needs¹.

0.1.11 Scalability

Both DuckDB and SQLite are embedded databases and are not designed for high scalability or distributed environments¹.

In summary, if your project involves complex analytical queries and large datasets, DuckDB might be the better choice. For simpler, transactional workloads and ease of use, SQLite is a great option.

Do you have a specific project in mind where you’re considering using one of these databases?

¹: RedSwitches Blog ²: Lukas Barth Blog

Source: Conversation with Copilot, 28/07/2024 (1) Best Embedded Databases: A DuckDB Vs SQLite Comparison. https://www.redswitches.com/blog/duckdb-vs-sqlite/. (2) Benchmarking DuckDB vs SQLite for Simple Queries | Lukas Barth. https://www.lukas-barth.net/blog/sqlite-duckdb-benchmark/. (3) Researchers, please replace SQLite with DuckDB now. https://dirk-petersen.medium.com/researchers-please-replace-sqlite-with-duckdb-now-f038044a2702. (4) DuckDB vs. SQLite Comparison - SourceForge. https://sourceforge.net/software/compare/DuckDB-vs-SQLite/.

DuckDB offers several advantages, making it a popular choice for certain use cases:

1. In-Memory Processing: DuckDB is designed for efficient in-memory processing, which can lead to faster query execution times compared to disk-based databases.

2. SQL Compatibility: It supports a wide range of SQL features, making it easy to use for those familiar with SQL.

3. Lightweight and Embedded: DuckDB is lightweight and can be embedded directly into applications, eliminating the need for a separate database server.

4. Columnar Storage: It uses columnar storage, which is beneficial for analytical queries as it allows for better compression and faster read performance.

5. Single-File Database: DuckDB stores the entire database in a single file, simplifying database management and portability.

6. Concurrency: While it supports concurrent reads, it also ensures data integrity by allowing only one write operation at a time.

7. Integration: DuckDB integrates well with various data science tools and environments, such as Python, R, and Jupyter notebooks.

8. Open Source: Being open source, DuckDB is free to use and has a growing community contributing to its development and improvement.

These features make DuckDB particularly well-suited for analytical workloads, data science projects, and scenarios where embedding a database within an application is desirable.

Are you considering using DuckDB for a specific project or application?

Yes, DuckDB does support multiple users accessing the same database simultaneously, but with some limitations. Multiple processes can open and read from the same database at the same time. However, only one process can write to the database at any given moment¹². This means that while concurrent reads are possible, concurrent writes are not.

If your application requires high levels of concurrency with multiple users writing to the database simultaneously, you might want to consider using a client/server database like PostgreSQL or MySQL¹.

Is there a specific use case you’re thinking about for using DuckDB?

Source: Conversation with Copilot, 28/07/2024 (1) Multiple running database instances on same database file #77 - GitHub. https://github.com/duckdb/duckdb/issues/77. (2) Concurrency – DuckDB. https://duckdb.org/docs/connect/concurrency.html. (3) How to connect from multiple processes to the same db file #5946 - GitHub. https://github.com/duckdb/duckdb/discussions/5946. (4) access database file on different computers at same time. https://stackoverflow.com/questions/45273134/access-database-file-on-different-computers-at-same-time. (5) undefined. https://www.sqlite.org/faq.html.