tidynpi Demo: NPPES Data Lookup in R

Efficient Provider Matching via Parquet Lake

Author

LFMG

Published

December 18, 2025

What This Demo Demonstrates

This document showcases the tidynpi package, which:

Makes use of a manifest of partitioned NPPES parquet files
Is capable of efficient name matching using Jaro-Winkler similarity
Completes selective reads with DuckDB’s pushdown filters
Has public HTTPS access to hosted data lake (no credentials required)
Includes a reproducible workflow with version-pinned snapshots

Workflow: Load inputs -> Connect to parquet lake -> Run enrichment -> Analyze results

1 Setup and Configuration

1.1 Package Loading

Load packages and configure environment

options(width = 120)

# ==============================================================================
# INSTALLATION & LOADING OPTIONS
# ==============================================================================

# Option 1: Install from GitHub (FIRST TIME ONLY - uncomment to run)
# if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
# remotes::install_github("Lfirenzeg/tidynpi")

# Option 2: Load installed package from GitHub
 library(tidynpi)
 cat(" Loaded tidynpi from GitHub installation\n")

 Loaded tidynpi from GitHub installation

Load packages and configure environment

# Option 3: Load from local development directory (this code is here as reference to show what was used during package development)
use_dev <- FALSE
pkg_dir <- "C:/Users/lucho/OneDrive/Documents/698/tidynpi"

if (use_dev) {
  if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
  }
  devtools::load_all(pkg_dir)
  cat("Loaded tidynpi from development directory\n")
}

# Load supporting packages
pkgs <- c("dplyr", "readr", "jsonlite", "DBI", "ggplot2", "knitr")
to_install <- pkgs[!vapply(pkgs, requireNamespace, logical(1), quietly = TRUE)]
if (length(to_install)) {
  cat("Installing missing packages:", paste(to_install, collapse = ", "), "\n")
  install.packages(to_install)
}

library(dplyr)
library(readr)
library(jsonlite)
library(DBI)
library(ggplot2)
library(knitr)

cat("All packages loaded successfully\n")

All packages loaded successfully

Development vs. Installed Package

Option 1 - GitHub Installation (Recommended for most users):

1. Uncomment lines for installing from GitHub (first time only)
2. Uncomment library(tidynpi) to load the package
3. Comment out or set use_dev <- FALSE

Option 2 - Local Development (For package developers):
1. Keep use_dev <- TRUE
2. Update pkg_dir to your local tidynpi directory
3. Uses devtools::load_all() for live code updates

Quick Start After GitHub Upload:

remotes::install_github("Lfirenzeg/tidynpi")  # Once
library(tidynpi)                               # Every session

2 Load Demo Data

2.1 Fetch Sample Inputs from GitHub

Download and prepare demo data

demo_url <- "https://raw.githubusercontent.com/Lfirenzeg/msds698/refs/heads/main/demo_ready.csv"

demo <- readr::read_csv(demo_url, show_col_types = FALSE)

# Verify required columns exist
stopifnot(all(c("full_name", "state_in", "city_in") %in% names(demo)))

# Prepare inputs: rename columns and filter out missing values
inputs <- demo |>
  transmute(
    full_name = full_name,
    state = state_in,
    city  = city_in
  ) |>
  filter(!is.na(full_name) & full_name != "", !is.na(state) & state != "") |>
  distinct()

cat("Loaded", nrow(inputs), "unique input records from GitHub\n")

Loaded 720 unique input records from GitHub

2.2 Sample Data for Quick Demo

Select random sample for demonstration

# For a quick demo, sample a smaller batch
set.seed(893)  # Reproducible sampling
n_take <- min(50L, nrow(inputs))
inputs_small <- inputs |> dplyr::slice_sample(n = n_take)

cat("Selected", n_take, "records for demo\n")

Selected 50 records for demo

Select random sample for demonstration

# Preview the inputs
head(inputs_small, 10) |> knitr::kable(caption = "Sample Input Records")

Sample Input Records
full_name	state	city
CRAIG COMBS	TN	KNOXVILLE
JOHN BURKE	CO	STEAMBOAT SPRINGS
MICHAEL CONNOLLY	OK	OKLAHOMA CITY
HANA DANDONA	NY	ROCKVILLE CENTER
ANTONINO INSANA	NJ	WOODBURY
AMANDA WINTERS	IA	FORT MADISON
JASON TROJACEK	TX	GROESBECK
JACLYN JONES	OK	TULSA
ALFREDO HERNANDEZ	FL	MIAMI
JAIME MICHAELSON	AZ	TUBA CITY

3 Connect to NPPES Parquet Lake