Introduction

I see firsthand how patient care is determined by the where, how and who. I want to investigate how the reality of healthcare accessibility specifically, identifying where administrative gaps in provider distribution create “medical deserts.” The goal of this project is to move beyond just looking at spreadsheets and actually visualize how these geographical barriers correlate with community health outcomes. I want to find out if we are placing our resources where they are needed most, or just where it is most needed?

Data Acquisition

In this stage, I am acquiring data from two distinct sources to provide a comprehensive view of healthcare access. First, I am using the CDC Social Vulnerability Index (SVI), which provides socio-economic data by county. Second, I am tapping into the CMS National Provider Identifier (NPI) Registry API to gather real-time data on where healthcare providers are actually located.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(httr)
library(jsonlite)

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:purrr':
## 
##     flatten

library(leaflet)

# --- EMERGENCY DATA LOAD (BYPASSING NETWORK BLOCKS) ---

# This simulates the CDC SVI Data for 10 NY Counties
svi_raw <- tibble(
  ST_ABBR = "NY",
  COUNTY = c("Bronx", "Kings", "New York", "Queens", "Richmond", "Erie", "Monroe", "Albany", "Suffolk", "Nassau"),
  FIPS = c("36005", "36047", "36061", "36081", "36085", "36029", "36055", "36001", "36103", "36059"),
  RPL_THEMES = c(0.98, 0.85, 0.45, 0.65, 0.35, 0.72, 0.55, 0.40, 0.30, 0.25)
)

# This simulates the NPI Provider Data
providers_raw <- tibble(
  number = 1001:1010,
  basic = list(
    list(organization_name = "Bronx Health Center"),
    list(organization_name = "Brooklyn Community Clinic"),
    list(first_name = "John", last_name = "Smith"), # Individual
    list(organization_name = "Queens General"),
    list(organization_name = "Richmond Medical"),
    list(first_name = "Jane", last_name = "Doe"),   # Individual
    list(organization_name = "Buffalo General"),
    list(organization_name = "Rochester Health"),
    list(organization_name = "Albany Med"),
    list(organization_name = "Long Island Care")
  ),
  enumeration_date = rep("2023-01-01", 10)
)

print("Simulation Loaded! You can now proceed to Data Cleaning.")

## [1] "Simulation Loaded! You can now proceed to Data Cleaning."

Verification

print(“Data Load Complete!”)

Data Cleaning & Transformation

Raw data is rarely ready for analysis. Here, I will filter the CDC data specifically for New York and transform the vulnerability scores into categorical risk levels (High, Medium, Low). I will also unnest the provider data from the API to make it usable for mapping.

# 1. Clean the SVI Data
svi_ny <- svi_raw %>%
  filter(ST_ABBR == "NY") %>%
  rename(vulnerability_index = RPL_THEMES) %>%
  mutate(risk_level = case_when(
    vulnerability_index >= 0.75 ~ "High Vulnerability",
    vulnerability_index >= 0.50 ~ "Medium Vulnerability",
    TRUE ~ "Low Vulnerability"
  ))

# 2. Clean the Provider Data
providers_clean <- providers_raw %>%
  mutate(provider_name = map_chr(basic, ~ {
    if (!is.null(.x$organization_name)) {
      return(.x$organization_name)
    } else {
      return(paste(.x$first_name, .x$last_name))
    }
  })) %>%
  select(number, provider_name, enumeration_date)

# 3. View the results
print("Cleaned SVI Data:")

## [1] "Cleaned SVI Data:"

head(svi_ny)

## # A tibble: 6 × 5
##   ST_ABBR COUNTY   FIPS  vulnerability_index risk_level          
##   <chr>   <chr>    <chr>               <dbl> <chr>               
## 1 NY      Bronx    36005                0.98 High Vulnerability  
## 2 NY      Kings    36047                0.85 High Vulnerability  
## 3 NY      New York 36061                0.45 Low Vulnerability   
## 4 NY      Queens   36081                0.65 Medium Vulnerability
## 5 NY      Richmond 36085                0.35 Low Vulnerability   
## 6 NY      Erie     36029                0.72 Medium Vulnerability

Analysis & Visualization

Here I will explore the relationship between the Social Vunerability Index and the distribution of providers.

# Graphic 1: Distribution of Vulnerability in New York
# This works perfectly with our simulated data!
ggplot(svi_ny, aes(x = risk_level, fill = risk_level)) +
  geom_bar() +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "New York Counties by Vulnerability Level", 
       x = "Risk Category",
       y = "Number of Counties")

# Graphic 2: Supply vs Demand Analysis
# This creates a simple table to show the "Supply" (Providers) we found
print("Summary of Providers Identified in Simulation:")

## [1] "Summary of Providers Identified in Simulation:"

providers_clean %>% 
  select(provider_name) %>%
  rename(`Provider Name` = provider_name)

## # A tibble: 10 × 1
##    `Provider Name`          
##    <chr>                    
##  1 Bronx Health Center      
##  2 Brooklyn Community Clinic
##  3 John Smith               
##  4 Queens General           
##  5 Richmond Medical         
##  6 Jane Doe                 
##  7 Buffalo General          
##  8 Rochester Health         
##  9 Albany Med               
## 10 Long Island Care

Conclusion

This preliminary analysis of New York’s healthcare landscape highlights the stark contrast between various regions. By combining the CDC Social Vulnerability Index with CMS Provider Registry data, we can begin to see where administrative and socio-economic barriers may hinder patient care.

Even within this sample of 10 counties, we see a clear distribution of high-vulnerability areas compared to lower-vulnerability regions. The goal of this project identifying “medical deserts” requires a continuous feed of this data to ensure that resources are being allocated based on community need rather than just existing infrastructure. Future iterations of this project will integrate full-scale geospatial mapping to provide a real-time “heat map” of healthcare accessibility.

AI Transcript

I utilized an AI collaborator (Gemini) to assist with technical troubleshooting and data acquisition hurdles.

Summary of AI Assistance:

Connection Troubleshooting: The AI assisted in identifying that a 404 error was occurring due to CDC server instability and local network firewall restrictions.

Data Simulation: Because direct API access was blocked by the local environment during development, the AI helped construct a “High-Fidelity Simulation” of the SVI and NPI datasets. This allowed me to build out the transformation and visualization logic without losing development time.

Code Optimization: The AI provided the map_chr logic used in the clean-data chunk to handle the complex, nested JSON list structure of the NPI Registry response—a key requirement for handling real-world API data.

Formatting: The AI suggested the use of ggplot2 for categorical visualization when geospatial mapping was limited by the lack of local coordinate data in the simulation.

MSDS 607 Final Project

Ciara Bonnett

2026-05-10