tidynpi Demo: NPPES Data Lookup in R

Efficient Provider Matching via Parquet Lake

Author

LFMG

Published

December 18, 2025

What This Demo Demonstrates

This document showcases the tidynpi package, which:

  • Makes use of a manifest of partitioned NPPES parquet files
  • Is capable of efficient name matching using Jaro-Winkler similarity
  • Completes selective reads with DuckDB’s pushdown filters
  • Has public HTTPS access to hosted data lake (no credentials required)
  • Includes a reproducible workflow with version-pinned snapshots

Workflow: Load inputs -> Connect to parquet lake -> Run enrichment -> Analyze results

1 Setup and Configuration

1.1 Package Loading

Load packages and configure environment
options(width = 120)

# ==============================================================================
# INSTALLATION & LOADING OPTIONS
# ==============================================================================

# Option 1: Install from GitHub (FIRST TIME ONLY - uncomment to run)
# if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
# remotes::install_github("Lfirenzeg/tidynpi")

# Option 2: Load installed package from GitHub
 library(tidynpi)
 cat(" Loaded tidynpi from GitHub installation\n")
 Loaded tidynpi from GitHub installation
Load packages and configure environment
# Option 3: Load from local development directory (this code is here as reference to show what was used during package development)
use_dev <- FALSE
pkg_dir <- "C:/Users/lucho/OneDrive/Documents/698/tidynpi"

if (use_dev) {
  if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
  }
  devtools::load_all(pkg_dir)
  cat("Loaded tidynpi from development directory\n")
}

# Load supporting packages
pkgs <- c("dplyr", "readr", "jsonlite", "DBI", "ggplot2", "knitr")
to_install <- pkgs[!vapply(pkgs, requireNamespace, logical(1), quietly = TRUE)]
if (length(to_install)) {
  cat("Installing missing packages:", paste(to_install, collapse = ", "), "\n")
  install.packages(to_install)
}

library(dplyr)
library(readr)
library(jsonlite)
library(DBI)
library(ggplot2)
library(knitr)

cat("All packages loaded successfully\n")
All packages loaded successfully
Development vs. Installed Package

Option 1 - GitHub Installation (Recommended for most users):

1. Uncomment lines for installing from GitHub (first time only)
2. Uncomment library(tidynpi) to load the package
3. Comment out or set use_dev <- FALSE

Option 2 - Local Development (For package developers):
1. Keep use_dev <- TRUE
2. Update pkg_dir to your local tidynpi directory
3. Uses devtools::load_all() for live code updates

Quick Start After GitHub Upload:

remotes::install_github("Lfirenzeg/tidynpi")  # Once
library(tidynpi)                               # Every session

2 Load Demo Data

2.1 Fetch Sample Inputs from GitHub

Download and prepare demo data
demo_url <- "https://raw.githubusercontent.com/Lfirenzeg/msds698/refs/heads/main/demo_ready.csv"

demo <- readr::read_csv(demo_url, show_col_types = FALSE)

# Verify required columns exist
stopifnot(all(c("full_name", "state_in", "city_in") %in% names(demo)))

# Prepare inputs: rename columns and filter out missing values
inputs <- demo |>
  transmute(
    full_name = full_name,
    state = state_in,
    city  = city_in
  ) |>
  filter(!is.na(full_name) & full_name != "", !is.na(state) & state != "") |>
  distinct()

cat("Loaded", nrow(inputs), "unique input records from GitHub\n")
Loaded 720 unique input records from GitHub

2.2 Sample Data for Quick Demo

Select random sample for demonstration
# For a quick demo, sample a smaller batch
set.seed(893)  # Reproducible sampling
n_take <- min(50L, nrow(inputs))
inputs_small <- inputs |> dplyr::slice_sample(n = n_take)

cat("Selected", n_take, "records for demo\n")
Selected 50 records for demo
Select random sample for demonstration
# Preview the inputs
head(inputs_small, 10) |> knitr::kable(caption = "Sample Input Records")
Sample Input Records
full_name state city
CRAIG COMBS TN KNOXVILLE
JOHN BURKE CO STEAMBOAT SPRINGS
MICHAEL CONNOLLY OK OKLAHOMA CITY
HANA DANDONA NY ROCKVILLE CENTER
ANTONINO INSANA NJ WOODBURY
AMANDA WINTERS IA FORT MADISON
JASON TROJACEK TX GROESBECK
JACLYN JONES OK TULSA
ALFREDO HERNANDEZ FL MIAMI
JAIME MICHAELSON AZ TUBA CITY

3 Connect to NPPES Parquet Lake

3.1 Load Manifest and Establish DuckDB Connection

The manifest is a JSON file that lists all available parquet shards by state and last-name initial.

Initialize connection to parquet lake
# Load the manifest from the public CloudFlare R2 bucket
# This uses the "latest" pointer to always get the most recent snapshot
cat("Loading manifest from CloudFlare R2 (public HTTPS)...\n")
Loading manifest from CloudFlare R2 (public HTTPS)...
Initialize connection to parquet lake
# Option 1: Use the latest snapshot (recommended)
man <- tnp_manifest()  # Automatically fetches latest via tnp_latest_url()

# Option 2: Pin to a specific snapshot for reproducibility (uncomment to use)
# specific_url <- "https://rnppes.org/nppes/2025-10-13/manifest_state_files.json"
# man <- jsonlite::fromJSON(specific_url, simplifyVector = FALSE)

stopifnot(is.list(man), "states" %in% names(man))

cat("  Manifest loaded successfully\n")
  Manifest loaded successfully
Initialize connection to parquet lake
cat("  Available states:", length(man$states), "\n")
  Available states: 60 
Initialize connection to parquet lake
cat("  Data is served from: rnppes.org\n\n")
  Data is served from: rnppes.org
Initialize connection to parquet lake
# Create an in-memory DuckDB connection
con <- tnp_duckdb(":memory:")

cat("  DuckDB connection established\n")
  DuckDB connection established
Remote Manifest Loading

No local files required! The manifest and all parquet data are served over public HTTPS from CloudFlare R2.

  • tnp_manifest(): Automatically uses the latest snapshot
  • Specific snapshot: Pin to a date for reproducible research
  • No credentials needed: Completely public access

This makes the demo fully reproducible for anyone without local data setup.

4 Run NPI Enrichment

4.1 Execute Matching with npi_enrich()

This is the core function that:

  1. Normalizes input names and states
  2. Searches relevant parquet shards
  3. Ranks candidates using Jaro-Winkler similarity
  4. Returns top matches per input
Run NPI enrichment pipeline
cat("Running NPI enrichment...\n")
Running NPI enrichment...
Run NPI enrichment pipeline
cat("Strategy: strict (higher precision)\n")
Strategy: strict (higher precision)
Run NPI enrichment pipeline
cat("City mode: prefer (boosts city matches in ranking)\n\n")
City mode: prefer (boosts city matches in ranking)
Run NPI enrichment pipeline
system.time({
  res <- npi_enrich(
    inputs = inputs_small,
    con = con,
    man = man,
    
    # Matching strategy
    strategy = "strict",          # "strict" = JW >= 0.90 | "loose" = JW >= 0.85
    city_mode = "prefer",         # "ignore" | "prefer" | "require"
    
    # Candidate limits
    max_candidates = 5,           # Maximum candidates returned per input
    
    # HTTP/retry parameters
    max_urls_per_query = 5,       # Batch size for parquet shard reads
    tries = 2,                    # Retry attempts for HTTP failures
    initial_wait = 0.2,           # Initial backoff wait (seconds)
    max_wait = 1,                 # Maximum backoff wait (seconds)
    sleep_between_batches = 0,    # Sleep between URL batches (seconds)
    
    # Progress reporting
    verbose = TRUE                # Print progress messages
  )
})
   user  system elapsed 
  25.69    2.38   58.83 
Run NPI enrichment pipeline
cat("\nEnrichment complete\n")

Enrichment complete
Close database connection
# Always close DB connections when done
DBI::dbDisconnect(con, shutdown = TRUE)
cat("DuckDB connection closed\n")
DuckDB connection closed
Strategy Options
  • strict: More precise (Jaro-Winkler ≥ 0.90), fewer false positives
  • loose: More recall (Jaro-Winkler ≥ 0.85), catches variant spellings

City Mode Options:

  • ignore: Match on name + state only
  • prefer: Boost ranking if city matches
  • require: Only return candidates where city matches exactly

5 Analyze Results

5.1 Inspect Result Structure

Examine returned data structure
cat("Result dimensions:", nrow(res), "rows ×", ncol(res), "columns\n\n")
Result dimensions: 72 rows × 31 columns
Examine returned data structure
dplyr::glimpse(res)
Rows: 72
Columns: 31
$ npi                <dbl> 1003042540, 1033598263, 1003009861, 1639239015, 1003012444, 1013649623, 1255577508, 1083798…
$ entity_type        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ first_name         <chr> "VERONICA", "VERONICA", "MOUSTAFA", "JAIME", "ANDREW", "ANDREW", "ANDREW", "ANDREW", "LAKSH…
$ last_name_leg      <chr> "COMBS", "COMBS", "BANNA", "MICHEL", "PICEL", "PRICE", "PRICE", "PICHLER", "SRINIVASAN", "S…
$ last_name          <chr> "COMBS", "COMBS", "BANNA", "MICHEL", "PICEL", "PRICE", "PRICE", "PICHLER", "SRINIVASAN", "S…
$ org_name           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ state              <chr> "AR", "AR", "AZ", "AZ", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CO", "CO", "CO", "CO", "…
$ city               <chr> "HARRISON", "WEST MEMPHIS", "PEORIA", "YUMA", "STANFORD", "RIVERSIDE", "OAKLAND", "CARMICHA…
$ tax_code_1         <chr> "1041C0700X", "104100000X", "207RC0000X", "2084A0401X", "2085R0202X", "152WC0802X", "101Y00…
$ tax_code_2         <chr> NA, NA, NA, "2084P0800X", "2085R0204X", "152WL0500X", NA, NA, NA, "207Q00000X", "2085R0202X…
$ tax_code_3         <chr> NA, NA, NA, NA, NA, "152WP0200X", NA, NA, NA, NA, NA, NA, NA, NA, "363A00000X", NA, NA, NA,…
$ tax_code_4         <chr> NA, NA, NA, NA, NA, "152WS0006X", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ tax_code_5         <chr> NA, NA, NA, NA, NA, "152WV0400X", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ state_part         <chr> "AR", "AR", "AZ", "AZ", "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CO", "CO", "CO", "CO", "…
$ lname_initial      <chr> "C", "C", "B", "M", "P", "P", "P", "P", "S", "S", "S", "B", "B", "C", "C", "H", "H", "H", "…
$ name_for_initial   <chr> "COMBS", "COMBS", "BANNA", "MICHEL", "PICEL", "PRICE", "PRICE", "PICHLER", "SRINIVASAN", "S…
$ lname              <chr> "COMBS", "COMBS", "BANNA", "MICHEL", "PICEL", "PRICE", "PRICE", "PICHLER", "SRINIVASAN", "S…
$ cand_first_norm    <chr> "VERONICA", "VERONICA", "MOUSTAFA", "JAIME", "ANDREW", "ANDREW", "ANDREW", "ANDREW", "LAKSH…
$ cand_last_norm     <chr> "COMBS", "COMBS", "BANNA", "MICHEL", "PICEL", "PRICE", "PRICE", "PICHLER", "SRINIVASAN", "S…
$ cand_full_norm     <chr> "VERONICA COMBS", "VERONICA COMBS", "MOUSTAFA BANNA", "JAIME MICHEL", "ANDREW PICEL", "ANDR…
$ cand_first_initial <chr> "V", "V", "M", "J", "A", "A", "A", "A", "L", "M", "E", "J", "J", "J", "J", "D", "D", "D", "…
$ cand_city_norm     <chr> "HARRISON", "WEST MEMPHIS", "PEORIA", "YUMA", "STANFORD", "RIVERSIDE", "OAKLAND", "CARMICHA…
$ input_id           <int> 47, 47, 13, 10, 27, 27, 27, 27, 28, 35, 48, 2, 2, 37, 37, 23, 23, 23, 18, 21, 21, 9, 9, 9, …
$ first_norm         <chr> "VERONICA", "VERONICA", "MOUSTAFA", "JAIME", "ANDREW", "ANDREW", "ANDREW", "ANDREW", "LAKSH…
$ last_norm          <chr> "COMBS", "COMBS", "BANNA", "MICHAELSON", "PICEL", "PICEL", "PICEL", "PICEL", "SRINIVASAN", …
$ city_norm          <chr> "HARRISON", "HARRISON", "SUN CITY WEST", "TUBA CITY", "STANFORD", "STANFORD", "STANFORD", "…
$ jaro_winkler       <dbl> 1.0000000, 1.0000000, 1.0000000, 0.9500000, 1.0000000, 0.9666667, 0.9666667, 0.9547619, 1.0…
$ exact_name         <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FAL…
$ city_match         <lgl> TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALS…
$ rank_num           <int> 1, 2, 2, 4, 1, 4, 4, 4, 1, 2, 2, 1, 2, 2, 3, 2, 4, 4, 2, 1, 4, 1, 2, 4, 1, 4, 1, 2, 1, 4, 4…
$ rank_label         <chr> "Extremely Likely Match", "Very Likely Match", "Very Likely Match", "Possible Match", "Extr…

5.2 Summary Statistics

Generate match quality metrics
if (nrow(res) == 0) {
  cat(" No matches returned.\n")
  cat("  Check your manifest path and parquet availability.\n")
} else {
  # Get the best match for each input (ranked by strategy)
  best <- res |>
    group_by(input_id) |>
    slice(1) |>
    ungroup()
  
  # Summary statistics
  cat("========================================\n")
  cat("MATCH SUMMARY\n")
  cat("========================================\n")
  cat("Inputs processed:        ", nrow(inputs_small), "\n")
  cat("Inputs with matches:     ", dplyr::n_distinct(res$input_id), "\n")
  cat("Total candidate records: ", nrow(res), "\n")
  cat("Exact best matches:      ", sum(best$exact_name, na.rm = TRUE), "\n")
  cat("Perfect JW (1.0) matches:", sum(abs(best$jaro_winkler - 1) < 1e-12, na.rm = TRUE), "\n")
  cat("========================================\n\n")
  
  # Match quality distribution
  best |>
    count(rank_label, sort = TRUE, name = "Count") |>
    mutate(Percentage = sprintf("%.1f%%", Count / sum(Count) * 100)) |>
    knitr::kable(caption = "Match Quality Distribution (Best Matches)")
}
========================================
MATCH SUMMARY
========================================
Inputs processed:         50 
Inputs with matches:      45 
Total candidate records:  72 
Exact best matches:       43 
Perfect JW (1.0) matches: 43 
========================================
Match Quality Distribution (Best Matches)
rank_label Count Percentage
Extremely Likely Match 26 57.8%
Very Likely Match 17 37.8%
Possible Match 2 4.4%

5.3 Validate Against Ground Truth NPIs

Since the demo data includes known NPIs, we can validate the matching accuracy:

Compare matched NPIs against ground truth
if (nrow(res) > 0) {
  # Join results with original demo data to get ground truth NPI
  # First, create a lookup from inputs_small back to original demo
  inputs_small_with_id <- inputs_small |>
    mutate(row_num = row_number())
  
  # The demo should have an 'NPI' column - let's check and join
  if ("NPI" %in% names(demo)) {
    # Create validation dataset
    validation <- inputs_small |>
      mutate(input_row = row_number()) |>
      left_join(
        demo |>
          transmute(
            full_name = full_name,
            state = state_in,
            city = city_in,
            true_npi = NPI
          ),
        by = c("full_name", "state", "city")
      ) |>
      select(input_row, full_name, state, city, true_npi)
    
    # Get best matches
    best_matches <- res |>
      group_by(input_id) |>
      slice(1) |>
      ungroup() |>
      select(input_id, npi, jaro_winkler, rank_label, exact_name)
    
    # Join validation with best matches
    validation_results <- validation |>
      left_join(best_matches, by = c("input_row" = "input_id")) |>
      mutate(
        match_status = case_when(
          is.na(npi) ~ "No Match Found",
          npi == true_npi ~ "Correct Match",
          TRUE ~ "Incorrect Match"
        )
      )
    
    # Calculate accuracy metrics
    cat("========================================\n")
    cat("NPI MATCHING ACCURACY\n")
    cat("========================================\n")
    cat("Total inputs:           ", nrow(validation_results), "\n")
    cat("Correct matches:        ", sum(validation_results$match_status == "Correct Match", na.rm = TRUE), "\n")
    cat("Incorrect matches:      ", sum(validation_results$match_status == "Incorrect Match", na.rm = TRUE), "\n")
    cat("No match found:         ", sum(validation_results$match_status == "No Match Found", na.rm = TRUE), "\n\n")
    
    # Calculate accuracy rate (excluding no matches)
    found_matches <- validation_results |> filter(match_status != "No Match Found")
    if (nrow(found_matches) > 0) {
      accuracy_rate <- sum(found_matches$match_status == "Correct Match") / nrow(found_matches) * 100
      cat("Accuracy rate (when match found): ", sprintf("%.1f%%", accuracy_rate), "\n")
    }
    
    # Overall accuracy (including no matches as incorrect)
    overall_accuracy <- sum(validation_results$match_status == "Correct Match") / nrow(validation_results) * 100
    cat("Overall accuracy:                 ", sprintf("%.1f%%", overall_accuracy), "\n")
    cat("========================================\n\n")
    
    # Show match status distribution
    validation_results |>
      count(match_status, sort = TRUE, name = "Count") |>
      mutate(Percentage = sprintf("%.1f%%", Count / sum(Count) * 100)) |>
      knitr::kable(caption = "Match Status Distribution")
    
  } else {
    cat("Ground truth NPI column not found in demo data\n")
  }
}
========================================
NPI MATCHING ACCURACY
========================================
Total inputs:            50 
Correct matches:         40 
Incorrect matches:       5 
No match found:          5 

Accuracy rate (when match found):  88.9% 
Overall accuracy:                  80.0% 
========================================
Match Status Distribution
match_status Count Percentage
Correct Match 40 80.0%
Incorrect Match 5 10.0%
No Match Found 5 10.0%
Visualize matching accuracy
if (nrow(res) > 0 && exists("validation_results")) {
  # Plot match status
  match_counts <- validation_results |>
    count(match_status, sort = TRUE)
  
  ggplot(match_counts, aes(x = reorder(match_status, n), y = n)) +
    geom_col(aes(fill = match_status), alpha = 0.8, show.legend = FALSE) +
    geom_text(aes(label = n), hjust = -0.2, size = 4) +
    scale_fill_manual(values = c(
      "Correct Match" = "#27AE60",
      "Incorrect Match" = "#E74C3C",
      "No Match Found" = "#95A5A6"
    )) +
    coord_flip() +
    labs(
      title = "NPI Matching Validation Results",
      subtitle = paste0("Total: ", nrow(validation_results), " inputs"),
      x = NULL,
      y = "Count"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      panel.grid.major.y = element_blank()
    )
}

Examine incorrect matches for patterns
if (nrow(res) > 0 && exists("validation_results")) {
  # Look at incorrect matches to understand why they failed
  incorrect <- validation_results |>
    filter(match_status == "Incorrect Match") |>
    select(full_name, state, city, true_npi, npi, jaro_winkler, rank_label)
  
  if (nrow(incorrect) > 0) {
    cat("Sample of Incorrect Matches:\n\n")
    incorrect |>
      head(10) |>
      knitr::kable(
        caption = "Examples of Incorrect NPI Matches",
        col.names = c("Name", "State", "City", "True NPI", "Matched NPI", "JW Score", "Rank")
      )
  } else {
    cat("No incorrect matches found!\n")
  }
}
Sample of Incorrect Matches:
Examples of Incorrect NPI Matches
Name State City True NPI Matched NPI JW Score Rank
MICHAEL CONNOLLY OK OKLAHOMA CITY 1003014374 1285953091 0.95 Possible Match
ALFREDO HERNANDEZ FL MIAMI 1003007881 1295856193 1.00 Extremely Likely Match
JAIME MICHAELSON AZ TUBA CITY 1003013426 1639239015 0.95 Possible Match
YU HYON KIM NY BRONX 1003029026 1457440927 1.00 Very Likely Match
KIMBERLY HARRIS MI PORT HURON 1003041617 1588626592 1.00 Very Likely Match
Understanding Validation Results

Correct Match: The top-ranked NPI from enrichment exactly matches the ground truth NPI

Incorrect Match: A match was found, but the NPI doesn’t match the ground truth

No Match Found: The enrichment pipeline didn’t return any candidates for this input

High accuracy rates validate that the matching pipeline (normalization + Jaro-Winkler + ranking) is working effectively!

5.4 Visualize Match Quality

Plot Jaro-Winkler score distribution
if (nrow(res) > 0) {
  best <- res |>
    group_by(input_id) |>
    slice(1) |>
    ungroup()
  
  # Histogram of Jaro-Winkler scores
  ggplot(best, aes(x = jaro_winkler)) +
    geom_histogram(bins = 20, fill = "#2C3E50", color = "white", alpha = 0.8) +
    geom_vline(xintercept = 0.90, linetype = "dashed", color = "#E74C3C", linewidth = 1) +
    annotate("text", x = 0.90, y = Inf, label = "Strict threshold (0.90)", 
             vjust = 2, hjust = -0.1, color = "#E74C3C") +
    labs(
      title = "Distribution of Jaro-Winkler Similarity Scores",
      subtitle = paste0("Best matches (n = ", nrow(best), ")"),
      x = "Jaro-Winkler Similarity Score",
      y = "Count"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      plot.subtitle = element_text(color = "gray40")
    )
}

Plot match rank distribution
if (nrow(res) > 0) {
  rank_counts <- best |>
    count(rank_label, sort = TRUE)
  
  ggplot(rank_counts, aes(x = reorder(rank_label, n), y = n)) +
    geom_col(fill = "#3498DB", alpha = 0.8) +
    geom_text(aes(label = n), hjust = -0.2, size = 4) +
    coord_flip() +
    labs(
      title = "Match Quality Categories",
      subtitle = "Distribution of best matches across quality tiers",
      x = NULL,
      y = "Number of Matches"
    ) +
    theme_minimal() +
    theme(
      plot.title = element_text(face = "bold", size = 14),
      panel.grid.major.y = element_blank()
    )
}

5.5 Sample of Best Matches

Display top matches with details
if (nrow(res) > 0) {
  best |>
    select(
      input_id,
      first_norm,
      last_norm,
      npi,
      jaro_winkler,
      exact_name,
      rank_label,
      city_match
    ) |>
    arrange(desc(jaro_winkler)) |>
    head(20) |>
    knitr::kable(
      digits = 4,
      caption = "Top 20 Best Matches (by Jaro-Winkler Score)",
      col.names = c("ID", "First", "Last", "NPI", "JW Score", "Exact?", "Quality", "City?")
    )
}
Top 20 Best Matches (by Jaro-Winkler Score)
ID First Last NPI JW Score Exact? Quality City?
2 JOHN BURKE 1003039157 1 TRUE Extremely Likely Match TRUE
4 HANA DANDONA 1003008038 1 TRUE Very Likely Match FALSE
6 AMANDA WINTERS 1003020496 1 TRUE Very Likely Match FALSE
7 JASON TROJACEK 1003016510 1 TRUE Extremely Likely Match TRUE
8 JACLYN JONES 1003001785 1 TRUE Extremely Likely Match TRUE
9 ALFREDO HERNANDEZ 1295856193 1 TRUE Extremely Likely Match TRUE
11 SHIRISH SATPUTE 1003041443 1 TRUE Extremely Likely Match TRUE
12 RAVISHANKAR RAMASWAMY 1003010406 1 TRUE Extremely Likely Match TRUE
13 MOUSTAFA BANNA 1003009861 1 TRUE Very Likely Match FALSE
14 MICHAEL GARGER 1003032616 1 TRUE Extremely Likely Match TRUE
15 EMILY LEASURE 1003005836 1 TRUE Extremely Likely Match TRUE
17 ERIC SCHENFELD 1003042375 1 TRUE Very Likely Match FALSE
18 TINA MONTEMURNO 1003007360 1 TRUE Very Likely Match FALSE
19 SAMATHA KADIYALA 1003002627 1 TRUE Very Likely Match FALSE
20 MICHAEL HOLLERBACH 1003028663 1 TRUE Extremely Likely Match TRUE
21 GARY GALLO 1003021825 1 TRUE Extremely Likely Match TRUE
22 JON LEPLEY 1003010133 1 TRUE Very Likely Match FALSE
23 DANIEL HAMMAN 1003002890 1 TRUE Very Likely Match FALSE
24 CHARLES CLAIR 1003012204 1 TRUE Extremely Likely Match TRUE
25 NICHOLAS DEFILIPPIS 1003010661 1 TRUE Very Likely Match FALSE

6 Taxonomy Code Translation

6.1 Understanding Provider Taxonomy Codes

One of the key features of tidynpi is the ability to translate cryptic NUCC taxonomy codes into human-readable descriptions. The enrichment results include columns like tax_code_1, tax_code_2, etc., which represent the provider’s specialties and classifications.

What are Taxonomy Codes?

The National Uniform Claim Committee (NUCC) maintains a Healthcare Provider Taxonomy code set that categorizes the type, classification, and specialization of healthcare providers. Each 10-character code represents a specific provider role.

Example: 207Q00000X = Family Medicine Physician

6.2 Load Taxonomy Dictionary

The tnp_taxonomy_dict() function downloads and caches the official NUCC taxonomy dictionary:

Download and cache taxonomy dictionary
cat("Loading NUCC taxonomy dictionary...\n")
Loading NUCC taxonomy dictionary...
Download and cache taxonomy dictionary
tax_dict <- tnp_taxonomy_dict()

cat("Loaded", nrow(tax_dict), "taxonomy codes\n\n")
Loaded 883 taxonomy codes
Download and cache taxonomy dictionary
# Preview the dictionary structure
head(tax_dict, 10) |>
  select(code, grouping, classification, specialization, display_name) |>
  knitr::kable(caption = "Sample Taxonomy Dictionary Entries")
Sample Taxonomy Dictionary Entries
code grouping classification specialization display_name
193200000X Group Multi-Specialty Multi-Specialty Group
193400000X Group Single Specialty Single Specialty Group
207K00000X Allopathic & Osteopathic Physicians Allergy & Immunology Allergy & Immunology Allopathic & Osteopathic Physicians
207KA0200X Allopathic & Osteopathic Physicians Allergy & Immunology Allergy Allergy (Allergy & Immunology)
207KI0005X Allopathic & Osteopathic Physicians Allergy & Immunology Clinical & Laboratory Immunology Clinical & Laboratory Immunology (Allergy & Immunology)
207L00000X Allopathic & Osteopathic Physicians Anesthesiology Anesthesiology Allopathic & Osteopathic Physicians
207LA0401X Allopathic & Osteopathic Physicians Anesthesiology Addiction Medicine Addiction Medicine (Anesthesiology)
207LC0200X Allopathic & Osteopathic Physicians Anesthesiology Critical Care Medicine Critical Care Medicine (Anesthesiology)
207LH0002X Allopathic & Osteopathic Physicians Anesthesiology Hospice and Palliative Medicine Hospice and Palliative Medicine (Anesthesiology)
207LP2900X Allopathic & Osteopathic Physicians Anesthesiology Pain Medicine Pain Medicine (Anesthesiology)

6.3 Translate Taxonomy Codes in Results

Use tnp_taxonomy_translate() to automatically add human-readable columns for all taxonomy codes:

Add taxonomy translations to results
if (nrow(res) > 0) {
  # Add taxonomy translations to the results
  res_with_tax <- tnp_taxonomy_translate(res)
  
  cat("Added taxonomy translation columns\n")
  cat("New columns added:", 
      paste(grep("^tax_.*_(classification|specialization|display_name)$", 
                 names(res_with_tax), value = TRUE), collapse = ", "), "\n\n")
  
  # Get best matches with taxonomy info
  best_with_tax <- res_with_tax |>
    group_by(input_id) |>
    slice(1) |>
    ungroup()
  
  # Show examples with taxonomy translations
  best_with_tax |>
    select(
      input_id,
      first_norm,
      last_norm,
      npi,
      tax_code_1,
      tax_tax_code_1_display_name,
      tax_code_2,
      tax_tax_code_2_display_name
    ) |>
    head(15) |>
    knitr::kable(
      caption = "Provider Matches with Taxonomy Translations",
      col.names = c("ID", "First", "Last", "NPI", 
                    "Tax Code 1", "Primary Specialty", 
                    "Tax Code 2", "Secondary Specialty")
    )
}
Added taxonomy translation columns
New columns added: tax_tax_code_1_classification, tax_tax_code_1_specialization, tax_tax_code_1_display_name, tax_tax_code_2_classification, tax_tax_code_2_specialization, tax_tax_code_2_display_name, tax_tax_code_3_classification, tax_tax_code_3_specialization, tax_tax_code_3_display_name, tax_tax_code_4_classification, tax_tax_code_4_specialization, tax_tax_code_4_display_name, tax_tax_code_5_classification, tax_tax_code_5_specialization, tax_tax_code_5_display_name 
Provider Matches with Taxonomy Translations
ID First Last NPI Tax Code 1 Primary Specialty Tax Code 2 Secondary Specialty
2 JOHN BURKE 1003039157 231H00000X Audiologist Speech, Language and Hearing Service Providers NA NA
3 MICHAEL CONNOLLY 1285953091 101Y00000X Counselor Behavioral Health & Social Service Providers NA NA
4 HANA DANDONA 1003008038 363AM0700X Medical (Physician Assistant) NA NA
6 AMANDA WINTERS 1003020496 2084P0015X Psychosomatic Medicine (Psychiatry & Neurology) 2084P0015X Psychosomatic Medicine (Psychiatry & Neurology)
7 JASON TROJACEK 1003016510 225100000X Physical Therapist Respiratory, Developmental, Rehabilitative and Restorative Service Providers 2251P0200X Pediatrics (Physical Therapist)
8 JACLYN JONES 1003001785 207X00000X Orthopaedic Surgery Allopathic & Osteopathic Physicians NA NA
9 ALFREDO HERNANDEZ 1295856193 174400000X Specialist Other Service Providers NA NA
10 JAIME MICHAELSON 1639239015 2084A0401X Addiction Medicine (Psychiatry & Neurology) 2084P0800X Psychiatry (Psychiatry & Neurology)
11 SHIRISH SATPUTE 1003041443 2084N0400X Neurology (Psychiatry & Neurology) 2084N0600X Clinical Neurophysiology (Psychiatry & Neurology)
12 RAVISHANKAR RAMASWAMY 1003010406 207RG0300X Geriatric Medicine (Internal Medicine) 207QG0300X Geriatric Medicine (Family Medicine)
13 MOUSTAFA BANNA 1003009861 207RC0000X Cardiovascular Disease (Internal Medicine) NA NA
14 MICHAEL GARGER 1003032616 111N00000X Chiropractor Chiropractic Providers NA NA
15 EMILY LEASURE 1003005836 207R00000X Internal Medicine Allopathic & Osteopathic Physicians 207R00000X Internal Medicine Allopathic & Osteopathic Physicians
17 ERIC SCHENFELD 1003042375 207P00000X Emergency Medicine Allopathic & Osteopathic Physicians 207P00000X Emergency Medicine Allopathic & Osteopathic Physicians
18 TINA MONTEMURNO 1003007360 207L00000X Anesthesiology Allopathic & Osteopathic Physicians 207L00000X Anesthesiology Allopathic & Osteopathic Physicians

6.4 Analyze Provider Specialties

Summarize provider specialties
if (nrow(res) > 0 && exists("best_with_tax")) {
  # Count primary specialties
  cat("========================================\n")
  cat("PRIMARY SPECIALTY DISTRIBUTION\n")
  cat("========================================\n\n")
  
  specialty_counts <- best_with_tax |>
    filter(!is.na(tax_tax_code_1_display_name)) |>
    count(tax_tax_code_1_display_name, sort = TRUE, name = "Count") |>
    head(15)
  
  specialty_counts |>
    mutate(Percentage = sprintf("%.1f%%", Count / sum(Count) * 100)) |>
    knitr::kable(
      caption = "Top 15 Primary Specialties in Matched Providers",
      col.names = c("Primary Specialty", "Count", "Percentage")
    )
}
========================================
PRIMARY SPECIALTY DISTRIBUTION
========================================
Top 15 Primary Specialties in Matched Providers
Primary Specialty Count Percentage
Internal Medicine Allopathic & Osteopathic Physicians 4 14.8%
Chiropractor Chiropractic Providers 3 11.1%
Orthopaedic Surgery Allopathic & Osteopathic Physicians 3 11.1%
Anesthesiology Allopathic & Osteopathic Physicians 2 7.4%
Clinical (Social Worker) 2 7.4%
Counselor Behavioral Health & Social Service Providers 2 7.4%
Family Medicine Allopathic & Osteopathic Physicians 2 7.4%
Neurology (Psychiatry & Neurology) 2 7.4%
Addiction Medicine (Psychiatry & Neurology) 1 3.7%
Adult Medicine (Family Medicine) 1 3.7%
Audiologist Speech, Language and Hearing Service Providers 1 3.7%
Cardiovascular Disease (Internal Medicine) 1 3.7%
Dermatology Allopathic & Osteopathic Physicians 1 3.7%
Diagnostic Radiology (Radiology) 1 3.7%
Emergency Medicine Allopathic & Osteopathic Physicians 1 3.7%

6.5 Visualize Specialty Distribution

Plot provider specialty distribution
if (nrow(res) > 0 && exists("best_with_tax")) {
  top_specialties <- best_with_tax |>
    filter(!is.na(tax_tax_code_1_display_name)) |>
    count(tax_tax_code_1_display_name, sort = TRUE) |>
    head(10)
  
  if (nrow(top_specialties) > 0) {
    ggplot(top_specialties, aes(x = reorder(tax_tax_code_1_display_name, n), y = n)) +
      geom_col(fill = "#27AE60", alpha = 0.8) +
      geom_text(aes(label = n), hjust = -0.2, size = 3.5) +
      coord_flip() +
      labs(
        title = "Top 10 Provider Specialties",
        subtitle = "Based on primary taxonomy codes in matched results",
        x = NULL,
        y = "Number of Providers"
      ) +
      theme_minimal() +
      theme(
        plot.title = element_text(face = "bold", size = 14),
        panel.grid.major.y = element_blank(),
        axis.text.y = element_text(size = 10)
      )
  }
}

6.6 Example: Lookup Individual Taxonomy Codes

You can also look up specific taxonomy codes directly:

Look up specific taxonomy codes
# Example taxonomy codes to look up
example_codes <- c(
  "207Q00000X",  # Family Medicine
  "208D00000X",  # General Practice
  "363L00000X",  # Nurse Practitioner
  "207R00000X",  # Internal Medicine
  "2084P0800X"   # Psychiatry & Neurology - Psychiatry
)

# Look up in dictionary
tax_dict |>
  filter(code %in% example_codes) |>
  select(code, grouping, classification, specialization, display_name) |>
  knitr::kable(
    caption = "Example Taxonomy Code Lookups",
    col.names = c("Code", "Grouping", "Classification", "Specialization", "Display Name")
  )
Example Taxonomy Code Lookups
Code Grouping Classification Specialization Display Name
207Q00000X Allopathic & Osteopathic Physicians Family Medicine Family Medicine Allopathic & Osteopathic Physicians
208D00000X Allopathic & Osteopathic Physicians General Practice General Practice Allopathic & Osteopathic Physicians
207R00000X Allopathic & Osteopathic Physicians Internal Medicine Internal Medicine Allopathic & Osteopathic Physicians
2084P0800X Allopathic & Osteopathic Physicians Psychiatry & Neurology Psychiatry Psychiatry (Psychiatry & Neurology)
363L00000X Physician Assistants & Advanced Practice Nursing Providers Nurse Practitioner Nurse Practitioner Physician Assistants & Advanced Practice Nursing Providers
Why Taxonomy Translation Matters

Without translation: tax_code_1 = "207Q00000X"
With translation: Primary Specialty = "Allopathic & Osteopathic Physicians (Family Medicine)"

This makes it much easier to:

  • Verify you matched the correct provider type
  • Filter results by specialty
  • Generate summary reports for stakeholders
  • Validate data linkage quality

7 Individual Function Examples

7.1 Example 1: Direct npi_normalize() Usage

Normalize names and states into search-ready format:

Normalize individual records
# Create a small test dataset
test_inputs <- data.frame(
  full_name = c("Smith, John", "O'Connor, Mary", "de la Cruz, Carlos"),
  state = c("NY", "California", "TX"),
  city = c("New York", "Los Angeles", "Houston"),
  stringsAsFactors = FALSE
)

# Normalize the inputs
normalized <- npi_normalize(
  inputs = test_inputs,
  full_name = "full_name",
  state = "state",
  city = "city"
)

normalized |>
  select(input_id, first_norm, last_norm, state_part, lname_initial, city_norm) |>
  knitr::kable(caption = "Normalized Inputs")
Normalized Inputs
input_id first_norm last_norm state_part lname_initial city_norm
1 JOHN SMITH NY S NEW YORK
2 MARY OCONNOR INTL O LOS ANGELES
3 CARLOS DE LA CRUZ TX D HOUSTON

7.2 Example 2: Understanding Partitioning Keys

The state_part and lname_initial columns are blocking keys used to select relevant parquet shards:

Show partitioning logic
# Show how names map to partitioning keys
examples <- data.frame(
  Name = c("Smith", "Zhang", "O'Brien", "#Unknown", "123Test"),
  Initial = substr(c("SMITH", "ZHANG", "OBRIEN", "#UNKNOWN", "123TEST"), 1, 1),
  Encoded = c("S", "Z", "O", "%23", "1"),
  stringsAsFactors = FALSE
)

examples |>
  knitr::kable(caption = "Last Name Initial Encoding Examples")
Last Name Initial Encoding Examples
Name Initial Encoded
Smith S S
Zhang Z Z
O’Brien O O
#Unknown # %23
123Test 1 1
Partitioning Strategy
  • State: Each US state + DC + territories + INTL catch-all
  • Last Name Initial: A–Z, 0–9, plus encoded specials (%23 for #, %2F for /, _ for other)

This creates over 1,500 shards (50 states × 30 initials), enabling selective reads of only relevant data.

8 Performance Tips

  1. Adjust max_urls_per_query
    • Increase to 3–5 for faster batch processing
    • Decrease to 1 if hitting rate limits
  2. Use city_mode = "ignore" when possible
    • Simplifies matching logic
    • City data can be noisy/incomplete
  3. Choose strategy wisely
    • Use "strict" for high-precision applications
    • Use "loose" when you need higher recall
  4. Limit max_candidates
    • Reduces result size and processing time
    • Set to 1 if you only need the best match
  5. Sample inputs for testing
    • Start with 10–50 records during development
    • Scale up once parameters are optimized

9 Troubleshooting

9.0.1 No matches returned

  • Check manifest path: Ensure the path exists and is valid JSON
  • Verify parquet availability: Check that the URLs in the manifest are accessible
  • Inspect normalized inputs: Run npi_normalize() separately to verify state/name parsing

9.0.2 HTTP 429 errors (rate limiting)

  • Increase initial_wait and max_wait
  • Decrease max_urls_per_query to 1
  • Add sleep_between_batches = 0.5

9.0.3 Low match quality

  • Try strategy = "loose" for more lenient matching
  • Check input data quality (typos, formatting issues)
  • Use city_mode = "ignore" if city data is unreliable

9.0.4 Slow performance

  • Increase max_urls_per_query to batch more URLs
  • Reduce tries to 2 if network is stable
  • Limit sample size during testing

10 Cleanup and Export

10.1 Export Results

Save results to CSV files
# Uncomment to save results
if (nrow(res) > 0) {
  best <- res |>
    group_by(input_id) |>
    slice(1) |>
    ungroup()
  
  # Export all candidates
  write_csv(res, "tidynpi_demo_results_all.csv")
  
  # Export best matches only
  write_csv(best, "tidynpi_demo_results_best.csv")
  
  cat(" Results exported to CSV files\n")
}

11 Session Info

Display R session information
sessionInfo()
R version 4.5.0 (2025-04-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.50         ggplot2_4.0.0      DBI_1.2.3          jsonlite_2.0.0     readr_2.1.5        dplyr_1.1.4       
[7] tidynpi_0.0.0.9000

loaded via a namespace (and not attached):
 [1] bit_4.6.0          gtable_0.3.6       crayon_1.5.3       compiler_4.5.0     tidyselect_1.2.1   parallel_4.5.0    
 [7] scales_1.4.0       yaml_2.3.10        fastmap_1.2.0      R6_2.6.1           labeling_0.4.3     generics_0.1.4    
[13] curl_7.0.0         httr2_1.2.2        htmlwidgets_1.6.4  tibble_3.2.1       pillar_1.11.1      RColorBrewer_1.1-3
[19] tzdb_0.5.0         rlang_1.1.6        stringi_1.8.7      xfun_0.52          S7_0.2.0           bit64_4.6.0-1     
[25] cli_3.6.5          withr_3.0.2        magrittr_2.0.3     stringdist_0.9.15  digest_0.6.37      grid_4.5.0        
[31] vroom_1.6.5        rstudioapi_0.17.1  rappdirs_0.3.3     hms_1.1.3          lifecycle_1.0.4    vctrs_0.6.5       
[37] evaluate_1.0.3     glue_1.8.0         duckdb_1.4.3       farver_2.1.2       rmarkdown_2.29     tools_4.5.0       
[43] pkgconfig_2.0.3    htmltools_0.5.8.1 

Next Steps
  • Try different strategy and city_mode combinations
  • Test with your own input data
  • Explore the full NPPES dataset (millions of providers!)
  • Check out the tidynpi documentation for more details