Guide To Merging Big Data Using DuckDB

This is a step by step guide on how to merge big NSQIP files (say for a specific surgery and other surgery specific relevant files). In this tutorial, we will learn how to merge, 2 years NSQIP PUF data files with Hysterectomy and GYN specific file.

Note to save time and avoid rerunning all the code, at this time all code chunks are commented to eval=FALSE option

Step 1: Load Required Packages

library(tidyverse)

library(duckdb)

library(haven) #read sas files

library(arsenal) #create tables

library(dplyr)

library(knitr)

Step 2: Connect to DuckB, Load Data and Check What Tables Exist

In the below step I am connecting to access a couple of PUF yearly files, without uploading them to DuckDB, as they have been provided to me by my able colleague Henry, a big shout out to him for pointing me to wonders of DuckDB. (In Step3, you will also see how to load files to DuckBD )
We can reconnect to a .duckdb database in future sessions like below, No need to re-import the data. For now we have “acs_nsqip_puf22” “acs_nsqip_puf23”, later we will have additional tables such as “merged_results” “puf_tar_hyst_2022” “puf_tar_hyst_2023”

✅ When we provide a dbdir , DuckDB stores the database as a file on disk — in this case, in our current working directory, unless we specify a full path. Basically, con object is using on-disk (persistent) mode.The tables (puf_tar_hyst_2022, etc.) are stored inside the file nsqip.duckdb.This file lives in our current working directory.

❌ If we had used dbdir = “:memory:” (or left it empty), then it would have been in-memory (RAM-only). That would be if we connect using object named con2,then we are using in-memory mode. All data (including the SAS files written via dbWriteTable) is stored in RAM only, within the DuckDB engine. Once you close the R session or disconnect, all tables and data are lost. This is fast and convenient but not persistent.

# Load libraries
library(DBI)
library(duckdb)


# Check connection status
con  # Should print something like <DuckDBConnection>

# Connect to DuckDB file
con <- dbConnect(duckdb::duckdb(), dbdir = "nsqip.duckdb") #This creates or connects to a DuckDB database file named nsqip.duckdb.


# List available tables
dbListTables(con)

Step 3 : Merge the two years data

The two files are stored in Duckdb tables, I need to merge them, by stacking them on top of each other. This can be acheived using the below code

# Read both tables
df_22 <- dbGetQuery(con, "SELECT * FROM acs_nsqip_puf22")
df_23 <- dbGetQuery(con, "SELECT * FROM acs_nsqip_puf23")

# Merge datasets while keeping additional columns
df <- bind_rows(df_22, df_23)  # Automatically aligns columns, missing ones will be NA

# Check merged dataframe
str(df)  # Structure of final dataset
head(df)  # Preview of merged data

Step4: Subsetting Merged Data for Hysterectomy

Now, I will subset the merged data to contain only Hysterectomy and call it df_subset.

cpt_codes <- c("58570", "58571", "58572", "58573", "58550", "58552", "58553", "58554", "58541", "58542", "58543", "58544")

df_subset <- df[df$CPT %in% cpt_codes |apply(df[, paste0("OTHERCPT", 1:10)], 1, function(x) any(x %in% cpt_codes)) |apply(df[, paste0("CONCPT", 1:4)], 1, function(x) any(x %in% cpt_codes)),
]

dim(df_subset)

object.size(df_subset)

Step5: Merging Hysterectomyy Specific Files to Hysterectomy Subset of Merged Data

In this step we convert large Hysterectomy specific SAS files into Duckdb format and once they are written as duckDB tables, remove them from R’s memory. This will add those files to DuckDB database.

At this time I am not merging another set of files, which will be gyn specific files, as this is a tutorial for illustrative purpose. For analysis purpose we will need to add them as well.

# Import large SAS files into DuckDB
puf_2022_hyst <- read_sas("puf_tar_hyst_2022.sas7bdat")
puf_2023_hyst <- read_sas("puf_tar_hyst_2023.sas7bdat")

# Write them to DuckDB
dbWriteTable(con, "puf_tar_hyst_2022", puf_2022_hyst, overwrite = TRUE)
dbWriteTable(con, "puf_tar_hyst_2023", puf_2023_hyst, overwrite = TRUE)

# Remove from R memory
rm(puf_2022_hyst, puf_2023_hyst)
gc()  # Free memory

Step6: Query DuckDb Files to Check Format and Column Names

The below code is looking at column names and making sure the PUF files and their corresponding year’s Hysterectomy files have common column names. Also, checking the structure of the files, by displaying first 5 rows.

dbListTables(con) #list all the tables in the database

#check the column names of the table
cols_acs_nsqip_puf22 <-colnames(dbReadTable(con, "acs_nsqip_puf22"))
cols_puf_tar_hyst_2022 <-colnames(dbReadTable(con, "puf_tar_hyst_2022"))
common_columns22 <- intersect(cols_acs_nsqip_puf22, cols_puf_tar_hyst_2022)
common_columns22


cols_acs_nsqip_puf23 <-colnames(dbReadTable(con, "acs_nsqip_puf23"))
cols_puf_tar_hyst_2023<-colnames(dbReadTable(con, "puf_tar_hyst_2023"))
common_columns23 <- intersect(cols_acs_nsqip_puf23, cols_puf_tar_hyst_2023)
common_columns23

# check structure 
dbGetQuery(con, "SELECT * FROM puf_tar_hyst_2022 LIMIT 5")
dbGetQuery(con, "SELECT * FROM puf_tar_hyst_2023 LIMIT 5")

Step7: Rename Column Case ID to make align column names before merger

Rename the CASEID column in the puf_tar_hyst_2022 and puf_tar_hyst_2023 tables to CaseID. The ALTER TABLE commands rename the CASEID column in puf_tar_hyst_2022 and puf_tar_hyst_2023 to CaseID so that you can join them with the acs_nsqip_puf22 and acs_nsqip_puf23 tables.
Perform the join based on CaseID. Joining tables: The LEFT JOIN is used to join the main tables (acs_nsqip_puf22 and acs_nsqip_puf23) with the corresponding puf_tar_hyst_2022 and puf_tar_hyst_2023 tables using the CaseID column.

Note: We can adjust the LEFT JOIN to INNER JOIN if we only want to keep rows that have matching CaseID in both tables.

# Rename CASEID to CaseID in both puf_tar_hyst_2022 and puf_tar_hyst_2023
dbExecute(con, "ALTER TABLE puf_tar_hyst_2022 RENAME COLUMN CASEID TO CaseID")
dbExecute(con, "ALTER TABLE puf_tar_hyst_2023 RENAME COLUMN CASEID TO CaseID")

# Now perform the joins
query_22 <- "
  SELECT *
  FROM acs_nsqip_puf22 AS acs
  LEFT JOIN puf_tar_hyst_2022 AS puf
  ON acs.CaseID = puf.CaseID
"

query_23 <- "
  SELECT *
  FROM acs_nsqip_puf23 AS acs
  LEFT JOIN puf_tar_hyst_2023 AS puf
  ON acs.CaseID = puf.CaseID
"

# Execute the queries
result_22 <- dbGetQuery(con, query_22)
result_23 <- dbGetQuery(con, query_23)

colnames(result_22)
colnames(result_23)


# Check if there is more than one "CaseID" column in either data frame
if(sum(colnames(result_22) == "CaseID") > 1) {
  # Rename the second "CaseID" in result_22 to avoid conflict
  colnames(result_22)[which(colnames(result_22) == "CaseID")[2]] <- "CaseID_22"
}

if(sum(colnames(result_23) == "CaseID") > 1) {
  # Rename the second "CaseID" in result_23 to avoid conflict
  colnames(result_23)[which(colnames(result_23) == "CaseID")[2]] <- "CaseID_23"
}

Step 8: Merge result_22 and result_23 directly in DuckDB and then remove them from R’s memory

Write result_22 and result_23 back to DuckDB as temporary tables.
Use SQL to merge them into a new table in DuckDB.
Remove the temporary tables after the merge.
Unload the data from R’s memory

# Check the column names of both data frames
colnames(result_22)
colnames(result_23)

# If result_22 has more columns, add missing columns to result_23
missing_cols <- setdiff(colnames(result_22), colnames(result_23))
for (col in missing_cols) {
  result_23[[col]] <- NA  # Add missing columns with NA values
}

# If result_23 has more columns, add missing columns to result_22
missing_cols <- setdiff(colnames(result_23), colnames(result_22))
for (col in missing_cols) {
  result_22[[col]] <- NA  # Add missing columns with NA values
}

# Ensure column order is the same in both data frames
result_23 <- result_23[, colnames(result_22)]

# Check if the columns now match
colnames(result_22)
colnames(result_23)

# Write the cleaned data frames to DuckDB as temporary tables
dbWriteTable(con, "temp_result_22", result_22, overwrite = TRUE)
dbWriteTable(con, "temp_result_23", result_23, overwrite = TRUE)

# Merge (row-bind) them using SQL with UNION ALL
dbExecute(con, "
  CREATE TABLE merged_results AS
  SELECT * FROM temp_result_22
  UNION ALL
  SELECT * FROM temp_result_23
")

# Optionally, check the first few rows of the merged table to verify
dbGetQuery(con, "SELECT * FROM merged_results LIMIT 5")

# Clean up by dropping temporary tables
dbExecute(con, "DROP TABLE temp_result_22")
dbExecute(con, "DROP TABLE temp_result_23")

# Remove the data from R's memory
rm(result_22, result_23)
gc()  # Free memory

Step9: Create Basic Descriptive Table for Merged Data with Hysterectomy Specific Variables

Note this will create table, but it is a very large data, so a better thing before analysis, is to trim the data phat, by adding inclusion exclusion criteria, and pulling only those variables/columns out of 320 which are needed for analysis. For that I have created a separate file and will share it with later.

# Load the merged_results table from DuckDB into R as a data frame
merged_results_df <- dbGetQuery(con, "SELECT * FROM merged_results")

#merged_results_df$HYST_BLADFISTULA
# Now run tableby using the loaded data frame
tab2 <- tableby(ROBOT_USED ~ SEX + Age + RACE_NEW + SURGSPEC + SMOKE + DIABETES+HYST_BLADFISTULA, data = merged_results_df, control = mycontrols)

# View the result
summary(tab2)