NYC Wastewater & COVID-19 Hospitalizations Analysis

Author

Desiree Thomas

Intentions

The primary objective of this project was to explore and visualize the “Lead Time” hypothesis. Epidemiological studies suggest that SARS-CoV-2 concentrations in municipal wastewater serve as a leading indicator, providing a 10-14 day predictive window for subsequent COVID-19 hospitalizations. By tracking these viral signals across New York City’s diverse boroughs, I aimed to determine whether this lead time remains consistent or if it is affected by socio-economic disparities.

Approach (SDD)

The data used can be found via the links in the Citations section. In this project, I pulled in data concerning the viral loads in the Wastewater and Hospitalizations in NYC via APIs, the Census data was gathered using the Tidycensus R package and the mapping of the Boroughs with the water included was pulled in via a .geojson file.

This project was separated into two parts: the initial work which included the gathering the data, developing an approach and then the ‘prepping’ of the data: (cleaning, transformations, checks that the data was loading correctly, etc) and lastly, spec driven development (utilized after the script was finished and the master_data.rds file was created).

There were four primary transformations used for this project. Dates were standardized as the raw data from different agencies were at risk of potentially using different date formats; this prevented errors from disrupting the time-series analysis chart. Second, I used a key ‘Borough’ as a location identifier for the five Boroughs in NYC; this allowed me to use the left_join that aids the maps interactive features. The raw wastewater was averaged to help smooth the line in the Plotly chart to help visually aid in seeing trends in the data. A many-to-one join was used to help in the creation of the Choropleth map where the boroughs were shaded by income.

I defined a Data Contract within .spec-kit/specification.md. This contract governed how disparate datasets (from the Census Bureau, NYC DEP, and NYC DOHMH) would be joined, standardized, and utilized. By utilizing the Gemini 3.1 Pro (low) agent against these specifications, it prevented “logic drift” and ensured that the data joins remained accurate throughout the development lifecycle.

As with all projects, I chose the platforms and software that were most relevant to my use case. For the data cleaning and transformations and initial set-up, I utilized RStudio. However, for the spec driven development, I switched to Antigravity. To keep my files and changes across both IDE’s, I simply used Git for version control. All code is written using R.

Please note that while this work is reproducible, you will need to create your own .Renviron file and utilize your own API keys (including a key for the OpenAI LLM).

Data Engineering

To ensure a performant user experience, I intentionally separated the data engineering pipeline from the UI logic:

The Engine (data_prep.qmd): All heavy lifting—including API fetching, temporal alignment, and spatial joining—was abstracted into a separate script. This generates a clean, pre-computed master_data.rds file.
The Dashboard (app.R): The Shiny application exclusively handles UI rendering, dynamic filtering, and interactive visualizations, keeping the dashboard snappy and responsive.

Data Lineage Table

Feature	Source	Key Field(s)	Transformation
Viral Load	NYC DEP (Socrata)	`test_result`	7-day rolling mean per borough
Hospitalizations	NYC DOHMH (GitHub)	`hosp_count`	Aggregated daily counts
Socioeconomics	US Census (ACS)	`B19013_001E`	Median Household Income (Borough level)
Spatial Assets	NYC Planning	`boro_name`	Renamed to Borough for spatial join

Iterations

The project underwent several strategic iterations:

Static Analysis: I initially focused on basic data validation and static time-series plots to confirm the 10-14 day lead time hypothesis.

Interactive Dashboarding: The static analysis evolved into a Shiny dashboard, introducing reactive components for dynamic date-range and borough-level filtering.

Dynamic LLM Insights Engine: Finally, I pivoted from a static analytical summary to a live, dynamic insights engine. Please note that the AI Insights engine is a very simple and limited use case of an LLM. I integrated OpenAI’s gpt-4o to dynamically summarize 14-day trends on the fly. To ensure observability and debuggability of the LLM responses, I implemented local JSON audit logging (logs/ai_traces.json), automatically tracking prompts, responses, and API latency. This was an alternative to using LangSmith due to this project being written exclusively in R.

Final Results

The resulting application offers a dual-faceted view of public health data:

Socioeconomic Context (Leaflet Map): The interactive choropleth map visualizes Median Household Income across NYC boroughs, immediately grounding the user in the socio-economic reality of the city.

Viral Trends (Dual-Axis Plotly Chart): The reactive Plotly chart successfully maps the viral load against hospitalizations. Users can clearly observe the peaks in wastewater signals preceding hospitalization spikes, validating the lead time hypothesis while offering granular, borough-level insights.

Citations

NYC OPEN DATA: SARS-CoV-2 concentrations measured in NYC Wastewater https://data.cityofnewyork.us/Health/SARS-CoV-2-concentrations-measured-in-NYC-Wastewat/f7dc-2q9f/about_data

NYC OPEN DATA: COVID-19 Daily Counts of Cases, Hospitalizations, and Deaths https://data.cityofnewyork.us/Health/COVID-19-Daily-Counts-of-Cases-Hospitalizations-an/rc75-m7u3/about_data

NYC OPEN DATA: Borough Boundaries (water areas included) https://data.cityofnewyork.us/City-Government/Borough-Boundaries-water-areas-included-/wh2p-dxnf/about_data

Tidycensus: R Package used for socioeconomic factors

Google DeepMind. (2026). Gemini 3.1 Pro (Low). [Large Language Model - Agent]. Antigravity IDE. Accessed May 9 - 10, 2026.

Google DeepMind. (2026). Gemini 3.1 Thinking. [Large Language Model]. https://gemini.google.com. Accessed May 6 - 10, 2026.