Final Project Data Preparation

AGOG 508: GeoAI For SmartCities

Author

Alex Mossawir

Research Questions

All models forecast new COVID-19 hospital admissions two weeks ahead at the week + sewershed level. Train/test split is 80/20 chronological — the most recent 20% of weeks in each model’s window are held out for evaluation.

Question 1. How much historical covid hospitalization data is needed to train a model with good covid hospital admissions prediction performance at the sewershed level?

Part 1a compares M1 and M2 within the short, early-pandemic window (September 2020 – September 2021), when admissions volumes were high and signal was strong. The goal is to identify the best training parameters via hyperparameter sweep.

M1 is the baseline: a pooled LSTM trained on hospital admissions data, with input_size = 4 (new admissions, current patients, ICU patients, and a missing mask to identify forward-filled weeks) and new admissions as the prediction target. Default hyperparameters.
M2 is the winner of a hyperparameter sweep over the following grid, trained on the same sewersheds and date range as M1:
- Sequence length: 4, 8, or 12 weeks
- Learning rate: 0.01, 0.001, 0.0001
- Hidden size: 16, 32, 64

Part 1b compares M2 (early window, 108 sewersheds) against M3.1.1 (late window, January 2023 – December 2024, same 108 sewersheds, M2’s chosen hyperparameters transferred without re-optimization). This isolates the effect of data window — does training on a longer, later, more stably-instrumented period improve forecasting?

Question 2. Does SARS-CoV-2 wastewater surveillance data improve sewershed-level COVID hospitalization forecasting beyond what hospital occupancy features already provide?

This question is addressed by comparing M3.1.2 and M3.2 on a matched set of 69 sewersheds — those with both ≥1 reporting hospital and ≥70% wastewater coverage during the late window.

M3.1.2 uses the same hospital-only feature set as M3.1.1 (input_size = 4), restricted to the 69 wastewater-eligible sewersheds. This restriction makes the comparison against M3.2 apples-to-apples.
M3.2 adds two wastewater channel — vp_conc_log_mean, the log-transformed mean SARS-CoV-2 concentration per sewershed-week, and an informational missing mask to indicate what was forward-filled — bringing input_size = 6. All other architecture and hyperparameters match M3.1.2.

The contrast between M3.1.2 and M3.2 isolates the marginal contribution of wastewater signal to forecasting accuracy.

Data Sources:

Hospital Electronic Response Data System (HERDS) Hospital Survey: COVID-19 Hospitalizations and Beds https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Hospitalizations/jw46-jpb7/about_data

New York State Statewide COVID-19 Wastewater Surveillance Data https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Wastewater-Surve/hdxs-icuh/about_data

New York State Sewershed Polygons github.com/nys-wwsn/ny-sewer-spatial-data

Latitude and Longitudes of Hospitals to Spatially Join to Sewershed Data https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r/about_data

Part 0: Data Loading and Setup

0.1. Load Required Packages

Required packages for data cleaning: tidyverse
Required packages for presentation: knitr, kableExtra
Required packages for data retrieval: vroom

0.2 Import Run Mapping Information

Show code

## Date Ranges
window_start <- as.Date("2020-03-26")  # earliest hosp date
window_end   <- as.Date("2025-10-06")  # latest hosp date


#   Import Project Lookup
#--------------------------------------------------------------------------------------------------

lu_run <- read_excel("dictionaries/508_Final Project_Wastewater Monitoring_Lookup_AMossawir.xlsx")
##Print LU for Reference
lu_run %>%
kable(
    caption = "Project Lookup"
  )

Project Lookup
run_id	lu_year	lu_url	lu_dict	limit
hosp	hosp	https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD	dict_hosp	NA
waste	wwtp	https://health.data.ny.gov/api/views/hdxs-icuh/rows.csv?accessType=DOWNLOAD	dict_wwtp	NA

0.3. Generate Necessary Fields for Downloading CSV

New Columns: - Cache: Location of cached data - Cache_ts: Time that CSV was originally pulled from OData source

Show code

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data, to avoid re-downloading every time this
#     gets knit.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run$cache <- file.path(paste0("data/raw/",lu_run$run_id,".rds"))

## File Timestamp saved to
lu_run$cache_ts <- file.path(paste0("data/raw/",lu_run$run_id,"_ts.txt"))

0.4. Import Data

0.4.1. Helper Function - import_fetch or load

This function checks if the document specified in the run mapping is currently cached within the project folder. If it is, it loads the data into the LU directly. If it is not, it pulls the number of rows specified in the document (or all) from the url in the lookup, caches the new file, and loads into the LU.

This function will be called on an entire row of the LU at a time - The following columns will always need to be in the LU for this to work: - cache - relative file path + name of .rds for each dataset imported - cache_ts - relative file path + name of .txt file with time that the url was queried - lu_url - odata URL where data is being pulled from

Show code

import_fetchorload <- function(LU){ 
  
  ## Check if the cached data exists 
if(file.exists(LU$cache[[1]])) {
  ## If so - print the following information:
  print(
  paste0(
    "Loading data from ", LU$cache[[1]],
    " (saved at ", read_lines(LU$cache_ts[[1]])[[1]], ")",
    " - saving to df."
        )
      )
  ## Read in data from cached location and ts
  LU$data [[1]]<- readRDS(LU$cache[[1]])
  LU$data_timestamp <- read_lines(LU$cache_ts[[1]])
  return(LU)
  
} else {
  ##Print the message
    print(
      paste0(
        "No data saved at ", as.character(LU$cache[[1]]), 
        "- loading from url: ",as.character(LU$lu_url[[1]])
          )
        )
  if (is.na(LU$limit[[1]])==TRUE) {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]])))
  } else {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]]), n_max = LU$limit[[1]]))}
  ts <- strftime(Sys.time())
  cat(ts, file = LU$cache_ts[[1]])
  saveRDS(temp,LU$cache[[1]])
  LU$data_timestamp <- ts
  LU$data[[1]] <- temp
  return(LU)
}}

0.4.2. Loop through LU to import data

Final output will be a series of messages in console indicating whether it needed to query the URL or whether it was able to just pull from the DF directly. Data frames will be stored within the LU table that we’ve been operating on.

Show code

#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#  
#--------------------------------------------------------------------------------------------------

lu_run <- rowwise(lu_run) %>% 
  mutate(import_fetchorload(cur_data()))

[1] "Loading data from data/raw/hosp.rds (saved at 2026-04-28 09:36:59) - saving to df."
[1] "Loading data from data/raw/waste.rds (saved at 2026-04-28 09:37:50) - saving to df."

0.5. Import all Dictionaries

Show code

#   Import dictionaries
#--------------------------------------------------------------------------------------------------
#   This can be used to add additional years or remove years more quickly. building this process 
#   on a subset of 20 rows from 2016 and 2017, and then revising to add in the full year range 
#   and the full datasets once workflows are clean, so it can be batched.
#   
#   This is designed so that each dictionary is sourced from a different tab in an excel document.
#   
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "dictionaries/508_Final Project_dictionaries_AMossawir.xlsx"
#--------------------------------------------------------------------------------------------------
 
##pull years out of the run lu to loop over

year <- as.character(lu_run$lu_dict)
##create a dictionary list of sheets read from excel, looping over years
dicts <- lapply(year, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)

})


Loading dictionary from:dictionaries/508_Final Project_dictionaries_AMossawir.xlsx
Read sheet: dict_hosp
Loading dictionary from:dictionaries/508_Final Project_dictionaries_AMossawir.xlsx
Read sheet: dict_wwtp

Show code

## set the names of the objects using years
dicts <- setNames(dicts,year)

rm(year, dict)

Part 1. Data Cleaning and Processing

1.1. Column Rename and Type Classification

1.1.1. Helper Function - standardize_columns

Show code

#   Standardize Columns Function
#-------------------------------------------------------------------------------------------------
#   This is a function that standardizes the columns based on the provided dictionaries -
#   If this fails, it means that there is something that does not match in the dictionary.
#   
#   This is by design - I want it to be able to account for every single row, if something is
#   mapped incorrectly in the dataset, then it neeeds a new dictionary.
#
#   For example, while making this, I copy/pasted the rows from a print statement, but I
#   dropped one of the underscores on "__id" for the 2017, but not for the 2016 data. 
#
#--------------------------------------------------------------------------------------------------
 
standardize_columns <- function(run_lu, dictionaries){ 
    w_data <- run_lu$data[[1]]
    w_dict <- dictionaries[[ run_lu$lu_dict[[1]] ]]  
    
    w_names <- setNames(w_dict$api_field, w_dict$final_name)
    
    temp <- w_data %>%
        rename(!!!w_names) %>%
        mutate(
            across(
                all_of(w_dict$final_name[w_dict$final_type == "character"]),
                as.character
            ),
            across(
                all_of(w_dict$final_name[w_dict$final_type == "numeric"]),
                as.numeric
            ),
            across(
               all_of(w_dict$final_name[w_dict$final_type == "date"]),
                lubridate::mdy
               #~ as_date(substr(.x, 1, 10))
            ),
            across(
                all_of(w_dict$final_name[w_dict$final_type == "logical"]),
                as.logical
            )
        )
    run_lu$a_data[[1]] <- temp
    
    return(run_lu)
}

1.1.2. Loop through datasets to standardize

Show code

#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#   If this fails, check the console for issues with the LUs not matching.
#--------------------------------------------------------------------------------------------------    
lu_run <- rowwise(lu_run) %>% 
  mutate(standardize_columns(cur_data(),dicts))

Hospitals

v_date	v_pfi	v_facility	region	county	network	nyforwardregion	v_currentpatients	admissions	discharges_cum	fatalities_cum	beds_acute	beds_acute_occupied	beds_icu_1	beds_icu_occupied
2025-10-06	000727	OSWEGO HOSPITAL	CENTRAL NEW YORK REGIONAL OFFICE	OSWEGO	INDEPENDENT	CENTRAL NEW YORK	1	1	1101	82	55	42	8	6
2025-10-06	000812	GOUVERNEUR HOSPITAL	CENTRAL NEW YORK REGIONAL OFFICE	ST. LAWRENCE	ST. LAWRENCE HEALTH SYSTEM	NORTH COUNTRY	0	0	134	3	25	12	0	0
2025-10-06	000412	THE UNITY HOSPITAL OF ROCHESTER - ST. MARY'S CAMPUS	WESTERN REGIONAL OFFICE	MONROE	ROCHESTER REGIONAL HEALTH SYSTEM	FINGER LAKES	0	0	73	0	32	0	0	0

WWTPs

county	zip	v_totalpop	v_samplelocation	samplelocation_long	v_epaid	v_wwtp	CapacityMGD	StormwaterInput	SampleType	v_samplematrix	Pretreatment	SolidSeparation	ConcentrationMethod	ExtractionMethod	RecEffTargetName	RecEffSpikeMatrix	RecEffSpikeConc	PCRTarget	PCRGeneTarget	PCRGeneTargetRef	PCRType	LodRef	HumFracTargetMic	HumFracTargetMicRef	QuantStanType	StanRef	InhibitionMethod	NumNoTargetControl	v_date	v_flowrate	SampleID	LabID	v_testdate	v_conc	flag_belowlod	v_lodsewage	flag_ntcamplify	v_receffpct	InhibitionDetect	InhibitionAdjust	v_humfracmicconc	flag_qc
Albany	12158	31023	wwtp	NA	NY0025739	Town of Bethlehem - Cedar Hill WPCP	6	yes	24-hr time-weighted composite	raw wastewater	no	centrifugation	ultracentrifugation	zymo quick-rna fungal/bacterial miniprep #r2014	bcov vaccine	raw sample	1e+06	sars-cov-2	ip2 and ip4 combined	Chavarria-MirÃ³ G, Anfruns-Estrada E, MartÃnez-VelÃ¡zquez A, VÃ¡zquez-Portero M, Guix S, Paraira M, GalofrÃ© B, SÃ¡nchez G, PintÃ³ RM, Bosch A. Time Evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Wastewater during the First Pandemic Wave of COVID-19 in the Metropolitan Area of Barcelona, Spain. Appl Environ Microbiol. 2021 Mar 11;87(7):e02750-20. doi: 10.1128/AEM.02750-20. PMID: 33483313; PMCID: PMC8091622.	qPCR	Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, BrÃ¼nink S, Schneider J, Schmidt ML, Mulders DG. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045. Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100.	crassphage	Stachler E, Kelty C, Sivaganesan M, Li X, Bibby K, Shanks OC. Quantitative CrAssphage PCR assays for human fecal pollution measurement. Environmental science & technology. 2017 Aug 15;51(16):9146-54.	DNA	Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100.	Green HC, Field KG. Sensitive detection of sample interference in environmental qPCR. Water research. 2012 Jun 15;46(10):3251-60.	3	2024-10-15	2.45	20241015NY001025739A	Quadrant	2024-10-16	0	NA	800	FALSE	0.94	no	no	59200000	no
Albany	12158	31023	wwtp	NA	NY0025739	Town of Bethlehem - Cedar Hill WPCP	6	yes	24-hr time-weighted composite	raw wastewater	no	centrifugation	ultracentrifugation	zymo quick-rna fungal/bacterial miniprep #r2014	bcov vaccine	raw sample	1e+06	sars-cov-2	ip2 and ip4 combined	Chavarria-MirÃ³ G, Anfruns-Estrada E, MartÃnez-VelÃ¡zquez A, VÃ¡zquez-Portero M, Guix S, Paraira M, GalofrÃ© B, SÃ¡nchez G, PintÃ³ RM, Bosch A. Time Evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Wastewater during the First Pandemic Wave of COVID-19 in the Metropolitan Area of Barcelona, Spain. Appl Environ Microbiol. 2021 Mar 11;87(7):e02750-20. doi: 10.1128/AEM.02750-20. PMID: 33483313; PMCID: PMC8091622.	qPCR	Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, BrÃ¼nink S, Schneider J, Schmidt ML, Mulders DG. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045. Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100.	crassphage	Stachler E, Kelty C, Sivaganesan M, Li X, Bibby K, Shanks OC. Quantitative CrAssphage PCR assays for human fecal pollution measurement. Environmental science & technology. 2017 Aug 15;51(16):9146-54.	DNA	Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100.	Green HC, Field KG. Sensitive detection of sample interference in environmental qPCR. Water research. 2012 Jun 15;46(10):3251-60.	3	2023-01-24	5.33	20230124NY001025739A	Quadrant	2023-01-25	101000	NA	800	FALSE	3.45	no	no	45800000	no
Albany	12204	109426	wwtp	NA	NY0026875	Albany County WPD - North Plant	35	yes	24-hr time-weighted composite	raw wastewater	no	centrifugation	ultracentrifugation	zymo quick-rna fungal/bacterial miniprep #r2014	bcov vaccine	raw sample	1e+06	sars-cov-2	ip2 and ip4 combined	Chavarria-MirÃ³ G, Anfruns-Estrada E, MartÃnez-VelÃ¡zquez A, VÃ¡zquez-Portero M, Guix S, Paraira M, GalofrÃ© B, SÃ¡nchez G, PintÃ³ RM, Bosch A. Time Evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Wastewater during the First Pandemic Wave of COVID-19 in the Metropolitan Area of Barcelona, Spain. Appl Environ Microbiol. 2021 Mar 11;87(7):e02750-20. doi: 10.1128/AEM.02750-20. PMID: 33483313; PMCID: PMC8091622.	qPCR	Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, BrÃ¼nink S, Schneider J, Schmidt ML, Mulders DG. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045. Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100.	crassphage	Stachler E, Kelty C, Sivaganesan M, Li X, Bibby K, Shanks OC. Quantitative CrAssphage PCR assays for human fecal pollution measurement. Environmental science & technology. 2017 Aug 15;51(16):9146-54.	DNA	Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100.	Green HC, Field KG. Sensitive detection of sample interference in environmental qPCR. Water research. 2012 Jun 15;46(10):3251-60.	3	2023-06-05	20.00	20230605NY001026875A	Quadrant	2023-06-06	0	NA	800	FALSE	14.51	no	no	8050000	no

1.2. Relevant Columns Only

Show code

prefixes <- c("v_", "flag_")
colfilter <- function(w_data, prefix_list) {
  w_data <- w_data |>
    select(starts_with(prefix_list))
}

lu_run$a_data <- map(lu_run$a_data, ~ colfilter(.x, prefixes))

rm(prefixes)

Part 2: Data Subsetting on Exclusion Criteria

2.1 Data Quality Rule Import

Based on the contents of the data dictionaries and the key variables being used for the analysis (permanent facility id), I have created a rule dictionary with instructions to be placed into the rule engine.

This first pass is to run through the basic data quality filters - not study design filters. Any observations that cannot be used for this analysis (e.g. enhanced de-identification redacting facility information, missing key case-mix identifier like age group) will be filtered through this stage. The engine will count the number of rows in each run’s dataset and log it, as it passes the data onto the next rule.

All rule dictionaries will have the same structure, so they can be fed through the same update/logging engine downstream, this pass uses the lu filters rule dictionary.

Dictionaries have the following structure:

stage: This indicates which rules should be applied - I am likely keeping these dictionaries in separate tabs, but I wanted to make this filter in case I stack it later.
rule_id: Unique identifier for the rules for QA/querying
rule_name: Unique legible identifier for QA
rule condition: the logic that will be used to identify which rows the rules are applied to.
rule_outcome: For filtering, all rules apply exclude, but downstream and for sensitivity analysis, the rule engine may be used for partitioning the dataset or other outcomes - this will be used to determine that process.
applies: List of which runs to apply the rule to - for example, the 2016 and 2017 datasets have the abortion_edit_indicator for enhanced de-identification, while the 2018 and 2019 datasets do not have this indicator, and identifying redactions must be done by searching for “Redacted for Confidentiality” within facility_name.
order: Rules are applied sequentially based on this column -logged_as This is the clean string used to flag what the transformation was for the log table, to be formatted for final tables and outputs.

Show code

#   Import LU
#--------------------------------------------------------------------------------------------------
#   Set the stage here with the variable stage - this step is currently the data_quality pass, so
#   it's going to pull in the rules flagged with "data_quality" - in this case, from tab name.
#   Store it to data quality, again. If we want to import multiple LUs  - add tab names to 
#   the list.
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "dictionaries/508_Final Project_filters_AMossawir.xlsx"
stage <- c("data_quality")
#--------------------------------------------------------------------------------------------------
 
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict)
})


Loading dictionary from:dictionaries/508_Final Project_filters_AMossawir.xlsx
Read sheet: data_quality

Show code

temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))

dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts, dict)

##Print LU for Reference
dicts$lu_data_quality %>%
kable(
    caption = "Rules Lookup for Filtering"
  )

Rules Lookup for Filtering
stage	rule_id	rule_name	rule_condition	rule_outcome	output_prefix	applies	order	logged_as
data_quality	1	missing_pfi	is.na(v_pfi)	exclude	a	hosp	1	Missing Facility ID
data_quality	2	missing_outcome	is.na(v_admissions_totalnew)	exclude	a	hosp	2	Missing Admissions
data_quality	3	outside_window	v_date < window_start \| v_date > window_end	exclude	a	hosp	3	Outside Analysis Window
data_quality	4	missing_epaid	is.na(v_epaid)	exclude	a	waste	1	Missing Facility ID
data_quality	5	wwtp_only	v_samplelocation != “wwtp”	exclude	a	waste	2	Exclude non-WWTP samples
data_quality	6	influent_only	v_samplematrix != “raw wastewater”	exclude	a	waste	3	Exclude non-Influent samples
data_quality	7	qc_failed	flag_qc != “no”	exclude	a	waste	4	Sample failed quality control
data_quality	8	no_template_control_amplify	flag_ntcamplify == TRUE	exclude	a	waste	5	Sample failed quality control
data_quality	9	missing_concentration	is.na(v_conc) & (is.na(flag_belowlod) \| flag_belowlod == FALSE)	exclude	a	waste	6	Missing Concentration
data_quality	10	outside_window	v_date < window_start \| v_date > window_end	exclude	a	waste	7	Outside Analysis Window

2.2. Data Quality Rules Pass - Helper Functions

The structure of the engine will first loop through over run_id using lu_run, this is the exterior loop. For each run, the engine will first determine which rules apply applies.

Then it will loop over the rules in an interior loop. For each rule, it will log iniitial nrow, and save to audit log. Then it will either apply the rule filtering or skip over it. It will output final nrows to the audit log, and pass the revised data to the next step of the interior loop. It proceeds sequentially over the rows, so working data will be updated once the rule has been executed and logged.

The inner loop passes the final audit lists and the final w_data back to the outer loop. The outer loop saves the audit log and the data to columns in the lookup.

2.2.1. Cached Locations

Show code

#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data - this is write-only, just to keep records.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run[[paste0("cache_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,".rds"))
print(lu_run[[paste0("cache_", stage)]])

[1] "data/data_quality/hosp.rds"  "data/data_quality/waste.rds"

Show code

## File Timestamp saved to
lu_run[[paste0("cache_ts_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,"_ts.txt"))
print(lu_run[[paste0("cache_ts_", stage)]])

[1] "data/data_quality/hosp_ts.txt"  "data/data_quality/waste_ts.txt"

2.2.2. flag_rule_application

This function is called at a run_id level within the outer loop. Once the working lookup is selected, it loops over the rule list to determine if the rule is applicable to this particular run. i.e: a rule with the applies text “puf_2016, puf_2017” will be split into a vector and evaluated - it will only be applied when the run_id is in that vector.

Show code

#Function to Determine Rule Applicability - Within Outer Loop
#--------------------------------------------------------------------------------------------------
#   Within outer loop, looking at the current run_id that is being passed, 
#    and determine which rules should be applied.
#--------------------------------------------------------------------------------------------------

flag_rule_application <- function(rule, run_id) {
  applies_vec <- rule$applies_rule
  if ("all" %in% applies_vec) { TRUE } else if (run_id %in% applies_vec) { TRUE } else { FALSE }
}

2.2.3. apply_exclude

This function is called at a rule level within the inner loop. Because the lookup stores the logic for what columns should be excluded as a string, it needs to be unquoted to be properly evaluated as logic. This applies that logic with a filter function from dplyr in order to exclude anything that fits that criteria.

Show code

#Function Within Inner Loop - Actually Exclude
#--------------------------------------------------------------------------------------------------
#   Applies filter on the dataset using rule expression - called within the inner loop
#--------------------------------------------------------------------------------------------------

##function to if it excludes, does it
apply_exclude <- function(w_data, rule_expr) {
  w_data %>% filter(!(!!rule_expr))
}

2.2.4. apply_rules

This function is called on each specific rule list within the inner loop. It takes a single rule from the inner loop and applies it to the w_data working data set. It also generates a list of values to output to an audit table so all transformations (or skipped rules) can be tracked. This audit row lists the initial number of rows, and the final number of rows after transformation, and reports the number of observations affected by each rule.

This is the core of the inner loop. It returns transformed w_data and a row of the audit table.

Show code

#--------------------------------------------------------------------------------------------------
#   Function to Apply Rules - Within Inner Loop
#--------------------------------------------------------------------------------------------------
#   This takes the working dictionary and working data from the interior loop, and continues,
#   passing run_id from the outer loop. It always references [[1]] because it will be passed
#   rowwise. Outputs the updated dataset and a new list, audit_log.
#--------------------------------------------------------------------------------------------------
apply_rule <- function(rule, w_data, run_id) {
##Prelim Information  
  run_flag <- rule$flag_rule_applies
#Generate Audit Row
audit_row <- list(
  stage         = rule$stage,
  run_id        = run_id,
  rule_id       = rule$rule_id,
  rule_name     = rule$rule_name,
  rule_condition= rule$rule_condition,
  rule_outcome  = rule$rule_outcome,
  n_initial     = nrow(w_data),
  logged_as     = rule$logged_as,
  skipped       = !run_flag,
  applied       = run_flag
)
##     cat ( paste0("Run Id: ", string(run_id),   " Rule Name ",rule$rule_name), " Run flag: ", run_flag, " Class W_Data: ",  class(w_data))
## Apply dplyr rules based on what rules text is - this can be updated when I add others. But.
if (run_flag == FALSE) {
  # do nothing
} else if (rule$rule_outcome == "exclude") {
  rule_expr <- parse_expr(rule$rule_condition)
  w_data <- apply_exclude(w_data, rule_expr)
}
audit_row$n_final  <- nrow(w_data)
audit_row$n_change <- audit_row$n_initial - audit_row$n_final

audit_row$timestamp <- Sys.time()
  
   return(list(w_data = w_data, 
              audit_row = audit_row))
         
}

2.2.5. outer_pass

This function is the core of the outer loop. It is passed a single row from lu_run, and uses that information to call the relevant objects and manage the inner loop.

While it only takes a row from lu_run as a direct argument, it references the following global variables to preserve clarity and naming:

stage: This is used to specify throughout that we are on the data quality filtering step. This allows future steps to be built off the same engine.
stage_prefix: This is used to specify the output column name’s prefix - in this case, it is ‘a2’
prev_col: This is used to specify the source column name - in this case it is ‘a_data’.

Show code

#Function Within Outer Loop - Manage Inner Loop
#--------------------------------------------------------------------------------------------------
#   This is called by the outer loop and is used to manage the inner loop. It passes both
#   transformed data and the audit log back.
#--------------------------------------------------------------------------------------------------

outer_pass <- function(lu_row) {
      
  ##Get that run_id, data, dict to keep passing through  
    run_id <- lu_row$run_id
  # the whole reason this loop wasn't working is because this a_data is apparently. for some reason. a list length 1
    ## with a tibble at that first place. 
    w_data <- lu_row[[prev_col]][[1]]
    w_dict <- dicts[[paste0("lu_",stage)]]
    
    ## Convert rules from a dataframe into list of lists
    rules <- apply(w_dict, 1, as.list)

    for (j in seq_along(rules)) {
      rule <- rules[[j]]
      ##Applies is currently a vector hand-written in excel
        ##First remove all whitespace, then split out what remains
              ## It's gonna put the list into a list of length one with a vector in it.
              #you literally just want the vector.
      rule$applies_rule <- strsplit(gsub("\\s", "",rule$applies), ",")[[1]]
      
      ##We're looping through all the rules rn
      rule$flag_rule_applies <- flag_rule_application(rule, run_id)
      ## and saving the new items back into rules   
       rules[[j]] <- rule
          }

  ## Create an empty list object so that we can store the audit as we go.
    audit <- list()
  ##i did this just counting first but seq_along actually passes info along each
    for (j in seq_along(rules)) {
    ##pull out list j from dictionary
      rule <- rules[[j]]
##Call function apply_rule
##    Apply rule input:
##        rule = one single row (but as list) of dictionary
##        w_data = working dataset
##        run_id = the same for everything in this inner loop
##    Apply rule output:
##        a list of: w_data, audit_row    
      result <- apply_rule(rule, w_data, run_id)
      ## replace w_data with the new w_data saved in result returned list
      w_data <- result$w_data
      ##store 
      audit[[j]] <- result$audit_row
  
      
      }
    list(
    data = w_data,
    audit = audit
  )

}

2.3. Execute Function Calls to Clean Data

The functions above will refer to global stage, prev_col, stage_prefix variables. This is to allow for flexibility of the pipeline. prev_col indicates where the data for this stage should be sourced from, and stage_prefix indicates what the column name of the destination should be. I will revise these away from using the global variable approach when I have more time.

Show code

#--------------------------------------------------------------------------------------------------
## Stage Variable
stage <- "data_quality"
stage_prefix <- "a2_"

## Column Data Source
prev_col <- "a_data"


#--------------------------------------------------------------------------------------------------

newcolname <- paste0(stage_prefix, "data")

#--------------------------------------------------------------------------------------------------
results <- vector("list", nrow(lu_run))

#--------------------------------------------------------------------------------------------------

for (i in seq_len(nrow(lu_run))) {
  ##Actually run the nested loops here, saving to a new vector
  results[[i]] <- outer_pass(lu_run[i, ])
  
  ## Save the outputted day in the specified new column
  lu_run[[newcolname]][[i]] <- results[[i]]$data
  ##save audit log to a table
  lu_run$audit[[i]] <- bind_rows(results[[i]]$audit)
  ## Cache this
  saveRDS(results[[i]]$data,lu_run$cache_data_quality[[i]])
  ##Make a timestamp
  ts <- strftime(Sys.time())
  
  ##save the timestamp
  cat (ts, file = lu_run$cache_ts_data_quality[[i]])
}

rm(results, i , stage, stage_prefix, newcolname, ts, prev_col)

2.4. Output final analytical sample

Analytical Sample Size
	Hospitalizations	Wastewater
Initial Sample Size	326,349	33,970
Missing Facility ID	0	0
Missing Admissions	0	0
Outside Analysis Window	0	0
Exclude non-WWTP samples	0	537
Exclude non-Influent samples	0	1,010
Sample failed quality control	0	4,283
Missing Concentration	0	0
Final Analytical Sample	326,349	28,140

Part 3. Recoding And Restructuring

3.1. Import Lookups

3.1.1. Hospital Address Lookup

General Facility Information data sourced from DOH at this link: https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r/about_data

3.1.2. Sewershed Shapefiles and Metadata

Shapefiles and metadata sewershed sourced from this link: https://github.com/nys-wwsn/ny-sewer-spatial-data

Show code

##==================================================================================================
## SEWERSHED LOOKUPS
##--------------------------------------------------------------------------------------------------
## Builds three lookup tables used downstream by the recoding pipeline:
##   - lu_facilities : PFI → name, lat, long for every NYS hospital with geocoding
##   - lu_sewer      : one row per WWTP sewershed (epaid keyed) with polygon geometry  
##   - lu_pfi        : analysis-set hospitals spatially joined to their sewershed
##
## The spatial join — assigning each hospital point to the sewershed polygon it falls in — is
## the GeoAI core of the project. Wastewater concentrations are measured at the WWTP, so to pair
## a community signal with hospital admissions we need to know which sewershed each hospital
## drains to.
##==================================================================================================

##--------------------------------------------------------------------------------------------------
## 1. Hospital facility lookup (PFI → coordinates)
##--------------------------------------------------------------------------------------------------
## Source: NYS Health Facility General Information, snapshotted 2026-04-29.
## Provides authoritative lat/long for currently-operating facilities. PFIs are zero-padded to
## width 6 to match the format used in the hospitalizations data.
temp_pfi <-  vroom(file.path("dictionaries/Health_Facility_General_Information_20260429.csv"))
temp_pfi <- temp_pfi |>
  select(v_pfi =`Facility ID`,
         name = `Facility Name`,
         lat = `Facility Latitude`,
         long = `Facility Longitude`) |>
  mutate(v_pfi = formatC(v_pfi, width = 6, flag = "0", format = "d"))

##--------------------------------------------------------------------------------------------------
## 2. Sewershed metadata + polygon geometry
##--------------------------------------------------------------------------------------------------
## Two files from the nys-wwsn/ny-sewer-spatial-data GitHub repo:
##   - nys-wws-sewersheds.csv         : metadata (epaid, wwtp_name, population_served, sample 
##                                      location and type, etc.)
##   - nys-wws-sewersheds-polygons.csv: WKT polygon geometries
## Join key is implicit (cdc_id / sw_id pair); left_join() resolves it.
temp_sewer <- vroom(file.path("dictionaries/nys-wws-sewersheds.csv"))

## Polygon file ships without a CRS embedded — the repo README specifies WGS84, so we set 
## EPSG:4326 explicitly. Bbox sanity-checked against NY's footprint (xmin ≈ -79.7, ymax ≈ 45.0).
temp_shed <- st_read(file.path("dictionaries/nys-wws-sewersheds-polygons.csv"))

Reading layer `nys-wws-sewersheds-polygons' from data source 
  `C:\Users\alexm\OneDrive\Desktop\AGOG\Final Project\GeoAI Final Project\dictionaries\nys-wws-sewersheds-polygons.csv' 
  using driver `CSV'
Simple feature collection with 670 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -79.72533 ymin: 40.49584 xmax: -72.29054 ymax: 45.01053
CRS:           NA

Show code

temp_shed = st_set_crs(temp_shed, 4326)

## Attach metadata to polygons. st_make_valid() handles the inevitable self-intersection edges
## that S2 (sf's default geometry engine on lat/lon) refuses to process.
temp_sewer <-  left_join(temp_shed, temp_sewer)
temp_sewer <- st_make_valid(temp_sewer)

##--------------------------------------------------------------------------------------------------
## 2a. Filter to WWTP-level samples and dedupe
##--------------------------------------------------------------------------------------------------
## NY's wastewater network samples at multiple points per plant: "wwtp" (influent at the
## treatment plant — the catchment-wide signal we want), and "upstream" (sub-catchment manhole
## sampling, useful but represents a smaller nested polygon). We keep only "wwtp" because our
## wastewater concentration data corresponds to plant-level measurements.
##
## After filtering there's still 2x duplication of every row from a quirk in how the metadata
## file is structured — distinct(epaid) collapses to one row per sewershed.
##
## Then a chain of geometry sanitization to deal with self-overlapping multipolygons:
##   st_make_valid()                                  - fix invalid edges
##   st_union(geometry, by_feature = TRUE)            - dissolve internal overlaps within each
##                                                      multipolygon (some sewersheds have
##                                                      MULTIPOLYGONs whose constituent parts
##                                                      overlap each other)
##   st_make_valid()                                  - re-validate (st_union can produce
##                                                      invalid output on gnarly inputs)
temp_sewer <- temp_sewer |>
  filter(sample_location == "wwtp") |>
  distinct(epaid, .keep_all = TRUE) |>
  st_make_valid() |>
  mutate(geometry = st_union(geometry, by_feature = TRUE)) |>
  st_make_valid()

## Belt-and-suspenders: a few polygons still had internal self-overlap that survived the above.
## st_buffer(0) is the classic GIS folk remedy — forces GEOS to rebuild the geometry from
## scratch, which usually resolves topology issues that st_make_valid alone won't catch. Using
## planar GEOS (sf_use_s2(FALSE)) for this op because S2 doesn't support zero-buffer; switching
## back to S2 immediately after.
sf_use_s2(FALSE)
temp_sewer <- temp_sewer |>
  st_buffer(0) |>
  st_make_valid()
sf_use_s2(TRUE)

##--------------------------------------------------------------------------------------------------
## 3. Hand-coded facility additions for closed/merged hospitals
##--------------------------------------------------------------------------------------------------
## Eight hospitals are absent from the current Health Facility General Information dataset
## because they've closed, merged, or been absorbed since the last refresh — but they appear in
## the historical hospitalizations data and represent stable community catchments. Coordinates
## sourced manually from Google Maps; addresses verified against AHA / Healthgrades records.
##
## (Seven temporary COVID-19 surge facilities — PFIs 080001–080007, including Javits Center and
## USNS Comfort — are excluded upstream as non-catchment overflow facilities, not added here.)
manual_facilities <- tribble(
  ~v_pfi,    ~name,                                                            ~lat,             ~long,              ~v_address,
  "000325",  "THE UNIVERSITY OF VERMONT HEALTH NETWORK - ALICE HYDE MEDICAL CENTER", 44.85696878568751,  -74.29148442883366,  "133 Park St, Malone, NY 12953",
  "000565",  "EASTERN NIAGARA HOSPITAL - LOCKPORT DIVISION",                         43.17766479149377,  -78.67168996117513,  "521 East Ave, Lockport, NY 14094",
  "000798",  "CLAXTON-HEPBURN MEDICAL CENTER",                                       44.69286484859764,  -75.50025832557549,  "214 King St, Ogdensburg, NY 13669",
  "001153",  "WYOMING COUNTY COMMUNITY HOSPITAL",                                    42.75436489510769,  -78.13098270352755,  "400 N Main St, Warsaw, NY 14569",
  "001439",  "MOUNT SINAI BETH ISRAEL",                                              40.73865113151168,  -73.98243423411715,  "281 First Ave, New York, NY 10003",
  "000413",  "STRONG MEMORIAL HOSPITAL",                                       43.12161,         -77.62862,          "601 Elmwood Ave, Rochester, NY 14642",
  "000583",  "MOUNT ST MARYS HOSPITAL AND HEALTH CENTER",                      43.17270,         -79.03670,          "5300 Military Rd, Lewiston, NY 14092",
  "001464",  "NEW YORK-PRESBYTERIAN HOSPITAL - COLUMBIA PRESBYTERIAN CENTER",  40.84072,         -73.94192,          "622 W 168th St, New York, NY 10032"
)

##--------------------------------------------------------------------------------------------------
## 4. Restrict to the analysis universe and project to sf
##--------------------------------------------------------------------------------------------------
## We only need facility geocoding for PFIs that actually appear in the hospitalizations data —
## ~200 of NY's ~5,000 listed facilities. Filtering early keeps the spatial join fast and 
## reduces noise in the duplicate audit.
pfis_in_data <- lu_run$a2_data[[which(lu_run$run_id == "hosp")]] |> distinct(v_pfi) |> pull(v_pfi)

## Combine the dataset's geocoded facilities with the manual additions, then filter to the
## analysis universe. Drop facilities with missing longitude (rare, but they'd break st_as_sf).
lu_pfi_complete <- bind_rows(filter(temp_pfi, !is.na(temp_pfi$long)), manual_facilities) |>
  filter(v_pfi %in% pfis_in_data)

## Promote to an sf points object in WGS84.
lu_pfi_complete <- st_as_sf(lu_pfi_complete, coords = c("long", "lat"), crs = 4326)

##--------------------------------------------------------------------------------------------------
## 5. Spatial join: assign each hospital to its sewershed
##--------------------------------------------------------------------------------------------------
## Standard point-in-polygon. left = TRUE so unmatched hospitals (rural facilities outside any
## monitored sewershed — typically on septic systems, not connected to municipal sewer) survive
## the join with NA sewershed fields rather than being silently dropped. They become a 
## limitations-section bullet rather than vanishing.
lu_pfi_complete<- st_join(
  lu_pfi_complete,
  temp_sewer,
  join = st_intersects,
  left = TRUE
)

## A handful of sewershed multipolygons have internal self-overlap that survived every geometry
## sanitization above (st_make_valid, st_union, st_buffer(0)) — the geometries are gnarly enough
## that st_intersects() returns the same epaid twice for a single hospital point. The duplicate
## rows carry identical sewershed assignment, so collapsing post-join loses no information.
lu_pfi_complete <- lu_pfi_complete |>
  distinct(v_pfi, .keep_all = TRUE)

##--------------------------------------------------------------------------------------------------
## 6. Stash into the dicts list and clean up
##--------------------------------------------------------------------------------------------------
temp_dicts <- list(lu_sewer = temp_sewer, lu_facilities = temp_pfi, lu_pfi = lu_pfi_complete)
dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)

## drop temp tables
rm(temp_dicts, temp_pfi, temp_sewer, temp_shed, lu_pfi_complete)

3.2. Recode WWTP Data

Show code

ww <- lu_run$a2_data[[which(lu_run$run_id == "waste")]]
lu_sewer <- dicts$lu_sewer

ww <- ww |>
  mutate(
    flag_belowlod_derived = (v_conc == 0) | (v_conc < v_lodsewage),
    vp_conc               = if_else(flag_belowlod_derived, v_lodsewage / 2, v_conc),
    vp_conc_log           = log1p(vp_conc),
  ) |>
  select(v_epaid, v_date, vp_conc, vp_conc_log, flag_belowlod_derived,
         v_totalpop, v_wwtp)


sg <- nrow(ww)
ww <- left_join(ww, lu_sewer, by = c("v_epaid" = "epaid"))
if (!(nrow(ww) == sg)) {
  stop("join duping")
}
n_unmatched <- sum(is.na(ww$WKT))
if (n_unmatched > 0) {
  warning(sprintf("%d WW rows had no sewershed match", n_unmatched))
}

map_ww <- ww |>
  st_drop_geometry() |>
  count(v_epaid, WKT) |>
  st_as_sf(wkt = "WKT", crs = 4326)

lu_run$a3_data[[which(lu_run$run_id == "waste")]] <- ww

rm(ww, lu_sewer)

3.2.2 Map of Log-Transformed Sample Count by Sewesehd

Show code

map_ny_counties = counties(state = "NY", cb = TRUE, progress_bar = FALSE)

ggplot() +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey") +
  geom_sf(data = map_ww, aes(fill = n), linewidth = 0.2,, alpha = 0.8) +
  scale_fill_viridis_c(trans = "log10") +
  theme_classic() +
  labs(title = "Sample count (log-transformed) by sewershed") + 
       theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

3.3. Recode Hospital Data

Show code

## Grab Hospital Location from Dicts

lu_pfi <- dicts$lu_pfi
lu_sewer <- dicts$lu_sewer


hosp <- lu_run$a2_data[[which(lu_run$run_id == "hosp")]]
hosp <-  left_join(hosp, lu_pfi)

qa_excluded_hospitals <- hosp |>
  filter(is.na(epaid)) 


map_excluded_sf <- hosp |>
  filter(is.na(epaid)) |>
  ## Also filtering out surge centers
  filter(!str_starts(v_pfi, "080")) |>
  distinct(v_pfi, v_facility, geometry) |>
  st_as_sf()

hosp <- hosp |>
  filter(!is.na(epaid)) |>
  filter(!str_starts(v_pfi, "080"))

lu_run$a3_data[[which(lu_run$run_id == "hosp")]] <- hosp 


rm(hosp, lu_pfi)

3.3.1 Table of all Hospitals Excluded from Analysis

Show code

### List of Excluded Hospitals
qa_excluded_hospitals <- qa_excluded_hospitals |> 
  group_by(v_pfi, v_facility) |>
  summarize(
    n = n(),
    v_currentpatients = sum(v_currentpatients),
    v_patients_icu = sum(v_patients_icu),
    v_admissions_totalnew = sum(v_admissions_totalnew),
    min = min(v_date),
    max = max(v_date))

qa_excluded_hospitals|>
  kable()

v_pfi	v_facility	n	v_currentpatients	v_patients_icu	v_admissions_totalnew	min	max
000303	THE UNIVERSITY OF VERMONT HEALTH NETWORK - ELIZABETHTOWN COMMUNITY HOSPITAL	1630	1524	0	241	2020-03-26	2025-10-06
000541	NORTH SHORE UNIVERSITY HOSPITAL	1630	95218	16697	8799	2020-03-26	2025-10-06
000817	CLIFTON-FINE HOSPITAL	1630	241	0	45	2020-03-26	2025-10-06
000873	IRA DAVENPORT MEMORIAL HOSPITAL	1630	576	0	152	2020-03-26	2025-10-06
000885	LONG ISLAND COMMUNITY HOSPITAL	1630	25049	5730	3653	2020-03-26	2025-10-06
000889	UNIVERSITY HOSPITAL - STONY BROOK SOUTHAMPTON HOSPITAL	1630	9418	1976	1548	2020-03-26	2025-10-06
000896	ST. CHARLES HOSPITAL	1630	13159	3215	2306	2020-03-26	2025-10-06
000943	ST CATHERINE OF SIENA HOSPITAL	1630	25153	4122	4001	2020-03-26	2025-10-06
000968	CATSKILL REGIONAL MEDICAL CENTER - G. HERMANN SITE	1630	139	0	31	2020-03-26	2025-10-06
001002	ELLENVILLE REGIONAL HOSPITAL	1630	2742	0	336	2020-03-26	2025-10-06
003067	MILLARD FILLMORE SUBURBAN HOSPITAL	1630	32373	4931	5462	2020-03-26	2025-10-06
080001	JAVITS CENTER HOSPITAL	1624	4877	12	0	2020-04-01	2025-10-06
080002	USNS COMFORT HOSPITAL	25	516	0	0	2020-04-03	2020-04-27
080003	MOUNT SINAI - SAMARITANS PURSE	1609	499	12	0	2020-04-16	2025-10-06
080004	NYC H+H BILLIE JEAN KING TENNIS CENTER	1609	446	0	0	2020-04-16	2025-10-06
080005	NYC H+H ROOSEVELT ISLAND MEDICAL CENTER	1608	2502	0	0	2020-04-17	2025-10-06
080006	MAIMONIDES CROWN HEIGHTS	1608	418	0	0	2020-04-17	2025-10-06
080007	NYP - RYAN LARKIN FIELD HOSPITAL	1604	451	0	31	2020-04-21	2025-10-06

3.3.2 Map of Hospitals Outside Sewersheds

Show code

map_lu_sewer <- dicts$lu_sewer
ggplot() +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey")+
  geom_sf(data = map_lu_sewer, fill = "lightblue", color = "white", linewidth = 0.2, alpha = 0.8) +
  geom_sf(data = map_excluded_sf, color = "red", size = 2, alpha = 0.7) +
  theme_classic() +
  labs(title = "Hospitals excluded from sewershed analysis",
       subtitle = paste0(nrow(map_excluded_sf), " facilities outside any monitored sewershed"))+
  theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

Show code

rm(lu_sewer)

Part 4. Aggregating Weekly and Merging Inputs

Show code

week_start_dow <- 1  # Monday start
window_model_start <- "2023-01-01"
window_model_end <- "2024-12-31"

4.1. Assessing Data Distribution Prior to Aggregation

4.1.1. Visual Inspection of Distribution of Data Across Weekdays

Show code

ww <- lu_run$a3_data[[which(lu_run$run_id == "waste")]]

qa_ww_cadence <- ww %>%   # swap in whatever your post-filter WW df is called
  filter(!is.na(vp_conc)) %>%
  mutate(
    weekday = wday(v_date, label = TRUE, week_start = 1),  # Mon-start, ordered
    year    = year(v_date)
  )

ggplot(qa_ww_cadence, aes(x = weekday)) +
  geom_bar() +
  facet_wrap(~ year, ncol = 2) +
  labs(
    title    = "WW sampling cadence by weekday",
    subtitle = "Populated v_conc records, faceted by calendar year",
    x = NULL, y = "Records"
  ) +
  theme_minimal()

Show code

rm(ww)

4.1.2. Visual Inspection of Distribution of Data Across Weeks

Show code

qa_ww_cadence %>%
  mutate(iso_week = floor_date(v_date, "week", week_start = 1)) %>%
  count(iso_week) %>%
  ggplot(aes(iso_week, n)) +
  geom_col() +
  labs(title = "Total WW samples per ISO week",
       x = NULL, y = "Samples") +
  theme_minimal()

Show code

qa_ww_cadence %>%
  mutate(iso_week = floor_date(v_date, "week", week_start = 1)) %>%
  count(v_epaid, iso_week) %>%
  group_by(v_epaid) %>%
  mutate(total_n = sum(n)) %>%
  ungroup() %>%
  mutate(v_epaid = forcats::fct_reorder(v_epaid, total_n)) %>%
  ggplot(aes(iso_week, v_epaid, fill = n)) +
  geom_tile() +
  scale_fill_viridis_c() +
  labs(title = "WW sampling coverage by sewershed × week",
       subtitle = "Sorted by total sample count",
       x = NULL, y = "Sewershed") +
  theme_minimal() +
  theme(axis.text.y = element_blank())

4.2 Aggregating Weekly-Level Hospital Data

Show code

hosp <- lu_run$a3_data[[which(lu_run$run_id == "hosp")]]

# Two-step because admissions is a flow but census/ICU are stocks
# week_start set in params at beginning.
hosp_weekly <- hosp |>
  mutate(week = floor_date(v_date, "week", week_start = week_start_dow)
         ) |>
  # Step 1: per facility per week — sum admissions (flow), mean census/ICU (stock)
  group_by(sw_id, v_pfi, week) |>
  summarize(
    admissions_fac = sum(v_admissions_totalnew, na.rm = TRUE),
    census_fac     = mean(v_currentpatients,    na.rm = TRUE),
    icu_fac        = mean(v_patients_icu,       na.rm = TRUE),
    .groups = "drop"
  ) |>
  # Step 2: roll facilities up to sewershed-week — sum across facilities
  group_by(sw_id, week) |>
  summarize(
    v_admissions_totalnew = sum(admissions_fac, na.rm = TRUE),
    v_currentpatients     = sum(census_fac,     na.rm = TRUE),
    v_patients_icu        = sum(icu_fac,        na.rm = TRUE),
    .groups = "drop"
  )

rm(hosp)

head(hosp_weekly) |>
  kable()

sw_id	week	v_admissions_totalnew	v_currentpatients	v_patients_icu
36001NY0026867AL11	2020-03-23	0	26.25000	11.00000
36001NY0026867AL11	2020-03-30	0	71.00000	29.00000
36001NY0026867AL11	2020-04-06	0	115.14286	55.28571
36001NY0026867AL11	2020-04-13	0	93.42857	40.57143
36001NY0026867AL11	2020-04-20	0	85.71429	30.00000
36001NY0026867AL11	2020-04-27	8	84.57143	31.00000

4.3 Aggregating Weekly-Level WW Data

Show code

waste <- lu_run$a3_data[[which(lu_run$run_id == "waste")]]

# ---- WW weekly aggregation ----
waste_weekly <- waste |>
  mutate(week = floor_date(v_date, "week", week_start = week_start_dow)) |>
  group_by(sw_id, week) |>
  summarize(
    vp_conc_log_mean = mean(vp_conc_log, na.rm = TRUE),
    n_samples        = n(),
    frac_detects     = mean(!flag_belowlod_derived, na.rm = TRUE),
    .groups = "drop"
  )

rm(waste)

head(waste_weekly) |>
  kable()

sw_id	week	vp_conc_log_mean	n_samples	frac_detects
36001NY0022225ACWWF	2023-02-06	11.080618	1	1
36001NY0022225ACWWF	2023-02-13	5.993961	1	0
36001NY0022225ACWWF	2023-02-20	11.522886	1	1
36001NY0022225ACWWF	2023-02-27	8.888895	1	1
36001NY0022225ACWWF	2023-03-06	10.555839	1	1
36001NY0022225ACWWF	2023-03-13	5.993961	1	0

4.4 Generate panel skeleton from sewershed lookup

Show code

lu_sewer <- dicts$lu_sewer

all_swsheds <- unique(lu_sewer$sw_id)  # adjust to your actual ID col
panel_window_start <- as.Date("2020-09-01")
panel_window_end   <- as.Date("2024-12-31")

all_weeks <- seq.Date(
  from = floor_date(panel_window_start, "week", week_start = week_start_dow),
  to   = floor_date(panel_window_end,   "week", week_start = week_start_dow),
  by   = "week"
)
panel_skeleton <- expand_grid(
  sw_id = all_swsheds,
  week = all_weeks
)
rm(lu_sewer)


head(panel_skeleton)

# A tibble: 6 × 2
  sw_id               week      
  <chr>               <date>    
1 36011NY0021903CCWW1 2020-08-31
2 36011NY0021903CCWW1 2020-09-07
3 36011NY0021903CCWW1 2020-09-14
4 36011NY0021903CCWW1 2020-09-21
5 36011NY0021903CCWW1 2020-09-28
6 36011NY0021903CCWW1 2020-10-05

4.5 Join weekly hospital and wwtp data to panel - Mask Missing vals

Show code

# ---- Join ----
panel <- panel_skeleton |>
  left_join(waste_weekly,   by = c("sw_id", "week")) |>
  left_join(hosp_weekly, by = c("sw_id", "week"))

panel <- panel |>
  mutate(ww_missing_mask = as.integer(is.na(vp_conc_log_mean)),
         hosp_missing_mask = as.integer(is.na(v_admissions_totalnew)),
)
 
rm(panel_skeleton, waste_weekly, hosp_weekly)

head(panel)

# A tibble: 6 × 10
  sw_id week       vp_conc_log_mean n_samples frac_detects v_admissions_totalnew
  <chr> <date>                <dbl>     <int>        <dbl>                 <dbl>
1 3601… 2020-08-31            NA           NA           NA                     0
2 3601… 2020-09-07            NA           NA           NA                     0
3 3601… 2020-09-14            NA           NA           NA                     1
4 3601… 2020-09-21            NA           NA           NA                     0
5 3601… 2020-09-28            NA           NA           NA                     3
6 3601… 2020-10-05             5.99         1            0                     1
# ℹ 4 more variables: v_currentpatients <dbl>, v_patients_icu <dbl>,
#   ww_missing_mask <int>, hosp_missing_mask <int>

4.6 Forward Fill

Show code

panel_filled <- panel |>
  arrange(sw_id, week) |>
  group_by(sw_id) |>
  fill(v_admissions_totalnew, v_currentpatients, v_patients_icu, vp_conc_log_mean,
       .direction = "down") |>
  ungroup()

4.7. Diagnostics and Visual

4.7.1. Full Period Coverage by Sewershed Summaries

Show code

# Per-sewershed coverage summary
qa_coverage_summary <- panel_filled |>
  group_by(sw_id) |>
  summarize(
    n_weeks            = n(),
    hosp_coverage_pct  = mean(hosp_missing_mask == 0) * 100,
    ww_coverage_pct    = mean(ww_missing_mask == 0)   * 100,
    has_any_ww         = any(ww_missing_mask == 0),
    .groups = "drop"
  ) |>
  arrange(desc(ww_coverage_pct))

thresholds <- c(0.5, 0.6, 0.7, 0.8, 0.9)
qa_threshold_table <- tibble(
  threshold = thresholds,
  hosp_only      = map_int(thresholds, ~ sum(qa_coverage_summary$hosp_coverage_pct >= .x * 100)),
  ww_only        = map_int(thresholds, ~ sum(qa_coverage_summary$ww_coverage_pct   >= .x * 100)),
  hosp_and_ww    = map_int(thresholds, ~ sum(qa_coverage_summary$hosp_coverage_pct >= .x * 100 &
                                              qa_coverage_summary$ww_coverage_pct  >= .x * 100))
)
qa_threshold_table |> kable()

threshold	hosp_only	ww_only	hosp_and_ww
0.5	108	90	54
0.6	108	52	32
0.7	108	18	9
0.8	108	10	5
0.9	108	8	5

Show code

ggplot(qa_coverage_summary, aes(hosp_coverage_pct, ww_coverage_pct)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 70, linetype = "dashed") +
  geom_vline(xintercept = 70, linetype = "dashed") +
  labs(x = "Hosp coverage %", y = "WW coverage %",
       title = "Per-sewershed coverage (each point = one sewershed), full analytical window")

4.7.2. Sewershed coerage by late/early window

Show code

windows <- tribble(
  ~window_label,         ~start,         ~end,
  "Early (2020-09 to 2021-09)", as.Date("2020-09-01"), as.Date("2021-09-01"),
  "Late (2023-01 to 2024-12)",  as.Date("2023-01-01"), as.Date("2024-12-31")
)

qa_coverage_by_window <- windows |>
  pmap_dfr(function(window_label, start, end) {
    panel_filled |>
      filter(week >= start, week <= end) |>
      group_by(sw_id) |>
      summarize(
        hosp_coverage_pct = mean(hosp_missing_mask == 0) * 100,
        ww_coverage_pct   = mean(ww_missing_mask == 0)   * 100,
        .groups = "drop"
      ) |>
      mutate(window_label = window_label)
  })

qa_threshold_table_window <- qa_coverage_by_window |>
  group_by(window_label) |>
  group_modify(~ tibble(
    threshold   = thresholds,
    hosp_only   = map_int(thresholds, \(t) sum(.x$hosp_coverage_pct >= t * 100)),
    ww_only     = map_int(thresholds, \(t) sum(.x$ww_coverage_pct   >= t * 100)),
    hosp_and_ww = map_int(thresholds, \(t) sum(.x$hosp_coverage_pct >= t * 100 &
                                               .x$ww_coverage_pct  >= t * 100))
  )) |>
  ungroup()

qa_threshold_table_window |> kable()

window_label	threshold	hosp_only	ww_only	hosp_and_ww
Early (2020-09 to 2021-09)	0.5	108	8	5
Early (2020-09 to 2021-09)	0.6	108	8	5
Early (2020-09 to 2021-09)	0.7	108	7	5
Early (2020-09 to 2021-09)	0.8	108	6	4
Early (2020-09 to 2021-09)	0.9	108	5	3
Late (2023-01 to 2024-12)	0.5	108	138	63
Late (2023-01 to 2024-12)	0.6	108	137	63
Late (2023-01 to 2024-12)	0.7	108	131	62
Late (2023-01 to 2024-12)	0.8	108	113	58
Late (2023-01 to 2024-12)	0.9	108	78	47

Show code

ggplot(qa_coverage_by_window, aes(hosp_coverage_pct, ww_coverage_pct)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 70, linetype = "dashed") +
  geom_vline(xintercept = 70, linetype = "dashed") +
  facet_wrap(~ window_label) +
  labs(x = "Hosp coverage %", y = "WW coverage %",
       title = "Per-sewershed coverage by analysis window")

Based on this plot, Model 1 and 2 will use all eligible sewersheds - filtering to hospitals which have at least 70% coverage within this window.

Model 3.1 and 3.2, which utilize the late window, will filter to sewersheds with both at least 70% WW coverage and at least 70% hospital coverage. M3.1 will be evaluated twice - once with the full list of all eligible sewersheds (at least 70% coverage of hospitals, no WW threshold), and once for apples to apples comparison with M3.2.

4.8. Export Filled Panel and Hospital + Hosp/WW Eligibility Sets

Confirmed that the 108 sewersheds present in the early dataset are the same present in the late dataset, so M1-M3.1.1 will be run on the same “hosp_only” eligibility set, then m3.1.2 and M2.2 will be run on the “hosp_and_ww”.

4.8.1 - Eligibility Lists Written as CSVs

Show code

### Set this threshold - can change but based on the abvoe data, 70% is the sweet spot.
threshold <- 0.7


### Generate hospital only list - set to be Early Window but it's the same in both windows, confirmed
hosp_only <-  qa_coverage_by_window |>
  filter(window_label == "Early (2020-09 to 2021-09)" & 
           hosp_coverage_pct >= threshold
           ) |>
  pull(sw_id)

## Write the CSV
write.csv(hosp_only, file = file.path("data/eligibility/hosp_only.csv"), col.names = FALSE, row.names = FALSE)


## generate late period 70% coverage eligibility lists
hosp_and_ww <-  qa_coverage_by_window |>
  filter(window_label == "Late (2023-01 to 2024-12)" & 
           hosp_coverage_pct >= threshold &
           ww_coverage_pct >= threshold
           ) |>
  pull(sw_id)


## write the CSV
write.csv(hosp_and_ww, file = file.path("data/eligibility/hosp_and_ww.csv"), col.names = FALSE, row.names = FALSE)

4.8.1 - Write Filled Panel as CSVs

Show code

## Write the CSV
write.csv(panel_filled, file = file.path("data/panel/panel_filled.csv"), row.names=FALSE)

Part 5. Transition to Python

5.1. Python Environment Setup

Using Reticulate to work in a Conda environment here that I named “covid-lstm”.

Import torch, pandas, numpy

Show code

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
print(torch.__version__)

2.10.0

Show code

print(pd.__version__)

3.0.2

Show code

print("CUDA available:" , torch.cuda.is_available())

CUDA available: True

Show code

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

5.2. Import Data for Modeling

5.2.1. Model Lookup

Show code

##Kable equivalent - passes into HTML
def scrollable(df, height=350):
    from IPython.display import HTML
    style = f"max-height: {height}px; overflow-y: auto; border: 1px solid #ddd;"
    return HTML(f'<div style="{style}">{df.to_html(index=False)}</div>')

lu_models = pd.read_csv("dictionaries/508_Final Project_models_AMossawir.csv",
                       parse_dates=["window_start", "window_end"])

scrollable(lu_models)

model_id	stage	description	window_start	window_end	features	eligibility_set	seq_len	hidden_size	num_layers	learning_rate	batch_size	epochs	embedding_dim	seed
M1	baseline	Hosp-only LSTM, early window, default hparams	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	32	1	0.0010	64	30	8	508
M2_01	sweep	Sweep: seq_len=4, hidden=16, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	16	1	0.0100	64	30	8	508
M2_02	sweep	Sweep: seq_len=4, hidden=16, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	16	1	0.0010	64	30	8	508
M2_03	sweep	Sweep: seq_len=4, hidden=16, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	16	1	0.0001	64	30	8	508
M2_04	sweep	Sweep: seq_len=4, hidden=32, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	32	1	0.0100	64	30	8	508
M2_05	sweep	Sweep: seq_len=4, hidden=32, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	32	1	0.0010	64	30	8	508
M2_06	sweep	Sweep: seq_len=4, hidden=32, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	32	1	0.0001	64	30	8	508
M2_07	sweep	Sweep: seq_len=4, hidden=64, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	64	1	0.0100	64	30	8	508
M2_08	sweep	Sweep: seq_len=4, hidden=64, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	64	1	0.0010	64	30	8	508
M2_09	sweep	Sweep: seq_len=4, hidden=64, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	4	64	1	0.0001	64	30	8	508
M2_10	sweep	Sweep: seq_len=8, hidden=16, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	16	1	0.0100	64	30	8	508
M2_11	sweep	Sweep: seq_len=8, hidden=16, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	16	1	0.0010	64	30	8	508
M2_12	sweep	Sweep: seq_len=8, hidden=16, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	16	1	0.0001	64	30	8	508
M2_13	sweep	Sweep: seq_len=8, hidden=32, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	32	1	0.0100	64	30	8	508
M2_14	sweep	Sweep: seq_len=8, hidden=32, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	32	1	0.0010	64	30	8	508
M2_15	sweep	Sweep: seq_len=8, hidden=32, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	32	1	0.0001	64	30	8	508
M2_16	sweep	Sweep: seq_len=8, hidden=64, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	64	1	0.0100	64	30	8	508
M2_17	sweep	Sweep: seq_len=8, hidden=64, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	64	1	0.0010	64	30	8	508
M2_18	sweep	Sweep: seq_len=8, hidden=64, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	64	1	0.0001	64	30	8	508
M2_19	sweep	Sweep: seq_len=12, hidden=16, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	16	1	0.0100	64	30	8	508
M2_20	sweep	Sweep: seq_len=12, hidden=16, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	16	1	0.0010	64	30	8	508
M2_21	sweep	Sweep: seq_len=12, hidden=16, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	16	1	0.0001	64	30	8	508
M2_22	sweep	Sweep: seq_len=12, hidden=32, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	32	1	0.0100	64	30	8	508
M2_23	sweep	Sweep: seq_len=12, hidden=32, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	32	1	0.0010	64	30	8	508
M2_24	sweep	Sweep: seq_len=12, hidden=32, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	32	1	0.0001	64	30	8	508
M2_25	sweep	Sweep: seq_len=12, hidden=64, lr=0.01	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	64	1	0.0100	64	30	8	508
M2_26	sweep	Sweep: seq_len=12, hidden=64, lr=0.001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	64	1	0.0010	64	30	8	508
M2_27	sweep	Sweep: seq_len=12, hidden=64, lr=0.0001	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	12	64	1	0.0001	64	30	8	508
M2	tuned	Hosp-only LSTM, early window, sweep winner (TRANSCRIBE FROM SWEEP)	2020-09-01	2021-09-01	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	32	1	0.0100	64	30	8	508
M3_1_1	transfer	Hosp-only LSTM, late window, hospital-full-subsets TBD	2023-01-01	2024-12-31	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_only	8	32	1	0.0100	64	30	8	508
M3_1_2	transfer	Hosp-only LSTM, late window, M2 hparams transferred (TRANSCRIBE FROM M2)	2023-01-01	2024-12-31	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask	hosp_and_ww	8	32	1	0.0100	64	30	8	508
M3_2	treatment	Hosp+WW LSTM, late window, M2 hparams transferred (TRANSCRIBE FROM M2)	2023-01-01	2024-12-31	v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask,vp_conc_log_mean,ww_missing_mask	hosp_and_ww	8	32	1	0.0100	64	30	8	508

5.2.2. Eligibility Lists

Show code

## Read both from CSSV
hosp_only = pd.read_csv("data/eligibility/hosp_only.csv")["x"].tolist()
hosp_and_ww = pd.read_csv("data/eligibility/hosp_and_ww.csv")["x"].tolist()

eligibility_lookup = {
    "hosp_only": hosp_only,
    "hosp_and_ww": hosp_and_ww,
}
print("Hospital Only Subset: ", len(hosp_only))

Hospital Only Subset:  108

Show code

print("Hospital and WW Subset: ", len(hosp_and_ww))

Hospital and WW Subset:  69

5.2.3. Panel

Show code

panel = pd.read_csv("data/panel/panel_filled.csv", parse_dates=["week"])

print("Panel Dimensions: ", panel.shape)

Panel Dimensions:  (135292, 10)

Part 6. Modeling Utilities

6.1. Tensor Construction (panel_to_tensors)

Show code

# Creating Time-Series Sequences
# Neural networks require fixed-size input samples.

# To convert the continuous time series into training samples, we use a sliding window approach.

# For each sample:
    #Input:
    #A window containing the past **72 time steps**.

    #Target:
    # The temperature value **6 time steps in the future**.

# This converts the dataset into many input–output pairs suitable for training neural networks.

## This takes panel, single row of lookup, eligibility_lookup
## Horizon = 2 because we're predicting t+2
## Test fract = 0.2, this needs to be handled here because we are dividing this up by date-range for prediction.

def panel_to_tensors(panel_df, config_row, eligibility_lookup, horizon=2, test_frac=0.2):
    feature_cols = config_row["features"].split(",")     ## Grab list of inputs, turn into list
    seq_len = int(config_row["seq_len"])      ## Sequence length is lookback window, analogous to window size
    window_start = pd.to_datetime(config_row["window_start"]) ## Filter model date window
    window_end = pd.to_datetime(config_row["window_end"])    ## filter model date windo
    eligible_sw_ids = eligibility_lookup[config_row["eligibility_set"]]   ## filter this to only the relevant SWs
# 2. Filter panel to eligible sewersheds and analysis window
    p = panel_df[
        panel_df["sw_id"].isin(eligible_sw_ids) &
        (panel_df["week"] >= window_start) &
        (panel_df["week"] <= window_end)
    ].copy()
    p = p.sort_values(["sw_id", "week"])
    
    # 3. Backfill leading NAs in vp_conc_log_mean per sewershed (mask flags imputed cells)
    if "vp_conc_log_mean" in feature_cols:
        p["vp_conc_log_mean"] = (
            p.groupby("sw_id", group_keys=False)["vp_conc_log_mean"].bfill()
        )
    
    # 4. Build sw_id → integer index mapping for embedding layer
    sw_id_to_int = {sw: i for i, sw in enumerate(sorted(p["sw_id"].unique()))}
    
    # 5. Determine chronological train/test cutoff (same date for all sewersheds)
    all_weeks = sorted(p["week"].unique())
    cutoff_idx = int(len(all_weeks) * (1 - test_frac))
    cutoff_date = all_weeks[cutoff_idx]
    
    # 6. Per-sewershed sliding window: build sample triples - This is the same as create_sequence but has sw embed
    X_train, y_train, idx_train = [], [], []
    X_test, y_test, idx_test = [], [], []
    
    ## Looping over each sewershed - SW-loop features
    for sw_id, group in p.groupby("sw_id"):
        group = group.sort_values("week").reset_index(drop=True)
        features = group[feature_cols].values.astype(np.float32)
        ## target always v_admissions
        target = group["v_admissions_totalnew"].values.astype(np.float32)
        weeks = group["week"].values
        
        # Slide window: input = features[i : i+seq_len], target = log(target[i+seq_len+horizon-1])
        for i in range(len(group) - seq_len - horizon + 1):
            X_window = features[i : i + seq_len]
            ## this is because python >:|, -1 to get this to ex. 6 + 2 week window, 9 - 1 because 0 = 1
            y_value = target[i + seq_len + horizon - 1]
            ## this is because python >:|
            target_week = weeks[i + seq_len + horizon - 1]
            
            # Skip if target is NaN (shouldn't happen for hosp, can happen if WW row had no fill possible)
            if np.isnan(y_value) or np.isnan(X_window).any():
                continue
            
            # Log-transform target - count data, heavily skewed
            ## I am not doing poisson even though it's a better match because lab doesn't
            y_log = np.log1p(y_value)
            ## grab sewershed specificity for embedding
            sw_idx = sw_id_to_int[sw_id]
            
            # Assign to train or test by target week
            if target_week < cutoff_date:
                X_train.append(X_window)
                y_train.append(y_log)
                idx_train.append(sw_idx)
            else:
                X_test.append(X_window)
                y_test.append(y_log)
                idx_test.append(sw_idx)
    
    # 7. Stack into torch tensors
    def stack(X_list, y_list, idx_list):
        return (
            torch.tensor(np.array(X_list), dtype=torch.float32),
            torch.tensor(np.array(y_list), dtype=torch.float32),
            torch.tensor(np.array(idx_list), dtype=torch.long),
        )
    
    tensors = {
        "train": stack(X_train, y_train, idx_train),
        "test": stack(X_test, y_test, idx_test),
    }
    
    return tensors, sw_id_to_int

tensors, sw_id_to_int = panel_to_tensors(panel, lu_models.iloc[0], eligibility_lookup)
print("Train X shape:", tensors["train"][0].shape)

Train X shape: torch.Size([3024, 12, 4])

Show code

print("Test X shape:", tensors["test"][0].shape)

Test X shape: torch.Size([1188, 12, 4])

Show code

print("N sewersheds:", len(sw_id_to_int))

N sewersheds: 108

Train + Test = 3456 + 1188 = 4644 108 x (52-8 - 2 + 1(ew))

6.2. Sequence Dataset

Show code

## This is the same as what we use in seq pred lab but include added sw embed
## doesn't need tensor thing because did it before

from torch.utils.data import Dataset, DataLoader

class SequenceDataset(Dataset):
    def __init__(self, X, y, sw_idx):
        self.X = X
        self.y = y
        self.sw_idx = sw_idx

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx], self.sw_idx[idx]
      
X_train, y_train, idx_train = tensors["train"]
ds = SequenceDataset(X_train, y_train, idx_train)
print("len:", len(ds))

len: 3024

Show code

print("first sample shapes:", [t.shape for t in ds[0]])

first sample shapes: [torch.Size([12, 4]), torch.Size([]), torch.Size([])]

6.3. Model architecture (LSTMRegressor)

Show code

# LSTM Model Architecture: https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html
class LSTMRegressor(nn.Module):
    def __init__(self, input_size, hidden_size, n_sewersheds, num_layers, embedding_dim):
        super().__init__()
        self.lstm = nn.LSTM(input_size=input_size,
                            hidden_size=hidden_size,
                            num_layers=num_layers,
                            batch_first=True)
        # add this to embed sewershed
        self.embedding = nn.Embedding(n_sewersheds, embedding_dim)
        # add embedding dimension
        self.fc = nn.Linear(hidden_size + embedding_dim, 1)

    def forward(self, x, sw_idx):
        out, (h, c) = self.lstm(x)
        last_hidden = out[:, -1, :]
        emb = self.embedding(sw_idx)
        concat = torch.cat([last_hidden, emb], dim=1)
        return self.fc(concat)
      
      
      

config = lu_models.iloc[0]

n_features    = len(config["features"].split(","))
hidden_size   = int(config["hidden_size"])
num_layers    = int(config["num_layers"])
embedding_dim = int(config["embedding_dim"])
n_sewersheds  = len(sw_id_to_int)

model = LSTMRegressor(input_size=n_features, hidden_size=hidden_size,  n_sewersheds=n_sewersheds, embedding_dim=embedding_dim, num_layers=num_layers)

fake_X = torch.zeros(2, int(config["seq_len"]), n_features)
fake_idx = torch.tensor([0, 1], dtype=torch.long)
out = model(fake_X, fake_idx)
print("Output shape:", out.shape)

Output shape: torch.Size([2, 1])

Show code

print("Output values:", out)

Output values: tensor([[-0.4940],
        [-0.5252]], grad_fn=<AddmmBackward0>)

6.4. Training and evaluation helpers

6.4.1. Train One Epoch

Lab function + sw_idx

Show code


def train_one_epoch(model, loader, criterion, optimizer):
    model.train()
    total_loss = 0.0
    ## Added sw_batch
    for X_batch, y_batch, sw_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        ## Added sw_batch
        sw_batch = sw_batch.to(device)

        optimizer.zero_grad()
        ## Added sw_batch - with squeeze because got shape error
        preds = model(X_batch, sw_batch).squeeze(-1)
        loss = criterion(preds, y_batch)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * X_batch.size(0)

    return total_loss / len(loader.dataset)

6.4.2. Evaluate Loss

Lab function + sw_idx

Show code


@torch.no_grad()
def evaluate_loss(model, loader, criterion):
    model.eval()
    total_loss = 0.0

    for X_batch, y_batch, sw_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        sw_batch = sw_batch.to(device)

        preds = model(X_batch, sw_batch).squeeze(-1)
        loss = criterion(preds, y_batch)

        total_loss += loss.item() * X_batch.size(0)

    return total_loss / len(loader.dataset)

6.4.3. Collect predictions

Lab function + sw_idx

Show code


@torch.no_grad()
def collect_predictions(model, loader):
    model.eval()
    preds_all = []
    targets_all = []
    sw_all = []

    for X_batch, y_batch, sw_batch in loader:
        X_batch = X_batch.to(device)
        sw_batch = sw_batch.to(device)
        preds = model(X_batch, sw_batch).squeeze(-1).cpu().numpy()
        targets = y_batch.numpy()
        preds_all.append(preds)
        targets_all.append(targets)
        sw_all.append(sw_batch.cpu().numpy())

    preds_all = np.concatenate(preds_all)
    targets_all = np.concatenate(targets_all)
    # back-transform from log scale to counts
    preds_all = np.expm1(preds_all)
    targets_all = np.expm1(targets_all)
    sw_all = np.concatenate(sw_all)

    return preds_all, targets_all, sw_all

6.4.4. Plot learning Cureve

Show code


def plot_learning_curve(history, model_name):
    epochs_range = range(1, len(history["train_loss"]) + 1)
    plt.figure(figsize=(7, 4))
    plt.plot(epochs_range, history["train_loss"], marker="o", label="Train Loss")
    plt.plot(epochs_range, history["test_loss"],  marker="s", label="Test Loss")
    plt.xlabel("Epoch")
    plt.ylabel("MSE Loss (log scale)")
    plt.title(f"{model_name}: Train vs Test Loss")
    plt.legend()
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()

6.4.5. Train Model

Main adjustment from lab was swapping everything to be driven by the lookup row

Show code


def train_model(panel, config_row, eligibility_lookup, show_plot = True, verbose=True):

    # 1. Pull hparams from config row
    model_id     = config_row["model_id"]
    hidden_size  = int(config_row["hidden_size"])
    num_layers   = int(config_row["num_layers"])
    embedding_dim = int(config_row["embedding_dim"])
    batch_size   = int(config_row["batch_size"])
    epochs       = int(config_row["epochs"])
    lr           = float(config_row["learning_rate"])
    seed         = int(config_row["seed"])

    # 2. Reproducibility
    torch.manual_seed(seed)
    np.random.seed(seed)

    # 3. Build tensors via calling panel to tensors
    tensors, sw_id_to_int = panel_to_tensors(panel, config_row, eligibility_lookup)
    X_train, y_train, idx_train = tensors["train"]
    X_test,  y_test,  idx_test  = tensors["test"]
    n_features = X_train.shape[2]
    n_sewersheds = len(sw_id_to_int)
    
    # 4. Build datasets and loaders via sequence dataset
    train_ds = SequenceDataset(X_train, y_train, idx_train)
    test_ds  = SequenceDataset(X_test,  y_test,  idx_test)
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    test_loader  = DataLoader(test_ds,  batch_size=batch_size, shuffle=False)

     # 5. Build model via LSTM Regressor
    model = LSTMRegressor(
        input_size=n_features,
        hidden_size=hidden_size,
        n_sewersheds=n_sewersheds,
        num_layers=num_layers,
        embedding_dim=embedding_dim,
    ).to(device)
    
    # 6. Loss + optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    # 7. Training loop — uses Step D helpers
    history = {"train_loss": [], "test_loss": []}
    best_test_loss = float("inf")
    best_state = None
    
    for epoch in range(1, epochs + 1):
        train_loss = train_one_epoch(model, train_loader, criterion, optimizer)
        test_loss  = evaluate_loss(model, test_loader, criterion)
        history["train_loss"].append(train_loss)
        history["test_loss"].append(test_loss)
        if test_loss < best_test_loss:
            best_test_loss = test_loss
            best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
        if verbose:
            print(f"{model_id} | Epoch {epoch}/{epochs} | Train: {train_loss:.4f} | Test: {test_loss:.4f}")

    # 8. Restore best state, get predictions
    model.load_state_dict(best_state)
    preds, targets, sw_idx = collect_predictions(model, test_loader)
    
    # 9. Compute metrics on count scale
    mae  = float(np.mean(np.abs(preds - targets)))
    rmse = float(np.sqrt(np.mean((preds - targets) ** 2)))
    
    # 10. Return everything
    
    if show_plot:
        plot_learning_curve(history, model_id)
    return {
        "model_id":     model_id,
        "model":        model,
        "history":      history,
        "preds":        preds,
        "targets":      targets,
        "sw_idx":       sw_idx,
        "sw_id_to_int": sw_id_to_int,
        "mae":          mae,
        "rmse":         rmse,
    }

Part 7. Modeling

7.1. Model 1 - Baseline

Covid prediction off of hospital data only - early period where there is more data.

Show code

result_m1 = train_model(panel, lu_models.iloc[0], eligibility_lookup)

M1 | Epoch 1/30 | Train: 4.0112 | Test: 0.8657
M1 | Epoch 2/30 | Train: 1.1826 | Test: 0.5648
M1 | Epoch 3/30 | Train: 0.5845 | Test: 0.5041
M1 | Epoch 4/30 | Train: 0.4317 | Test: 0.4880
M1 | Epoch 5/30 | Train: 0.3595 | Test: 0.4756
M1 | Epoch 6/30 | Train: 0.3205 | Test: 0.4776
M1 | Epoch 7/30 | Train: 0.2986 | Test: 0.4594
M1 | Epoch 8/30 | Train: 0.2862 | Test: 0.4534
M1 | Epoch 9/30 | Train: 0.2767 | Test: 0.4667
M1 | Epoch 10/30 | Train: 0.2699 | Test: 0.4514
M1 | Epoch 11/30 | Train: 0.2661 | Test: 0.4473
M1 | Epoch 12/30 | Train: 0.2621 | Test: 0.4557
M1 | Epoch 13/30 | Train: 0.2611 | Test: 0.4556
M1 | Epoch 14/30 | Train: 0.2597 | Test: 0.4657
M1 | Epoch 15/30 | Train: 0.2563 | Test: 0.4524
M1 | Epoch 16/30 | Train: 0.2534 | Test: 0.4540
M1 | Epoch 17/30 | Train: 0.2504 | Test: 0.4810
M1 | Epoch 18/30 | Train: 0.2501 | Test: 0.4885
M1 | Epoch 19/30 | Train: 0.2484 | Test: 0.4824
M1 | Epoch 20/30 | Train: 0.2457 | Test: 0.4814
M1 | Epoch 21/30 | Train: 0.2434 | Test: 0.5054
M1 | Epoch 22/30 | Train: 0.2442 | Test: 0.5047
M1 | Epoch 23/30 | Train: 0.2426 | Test: 0.4774
M1 | Epoch 24/30 | Train: 0.2401 | Test: 0.4661
M1 | Epoch 25/30 | Train: 0.2390 | Test: 0.4968
M1 | Epoch 26/30 | Train: 0.2402 | Test: 0.4904
M1 | Epoch 27/30 | Train: 0.2370 | Test: 0.4722
M1 | Epoch 28/30 | Train: 0.2387 | Test: 0.4849
M1 | Epoch 29/30 | Train: 0.2344 | Test: 0.4853
M1 | Epoch 30/30 | Train: 0.2346 | Test: 0.5079

Show code

print("\n=== M1 final ===")


=== M1 final ===

Show code

print(f"MAE:  {result_m1['mae']:.3f} admissions")

MAE:  4.700 admissions

Show code

print(f"RMSE: {result_m1['rmse']:.3f} admissions")

RMSE: 9.945 admissions

Show code

print(f"Best test loss (log MSE): {min(result_m1['history']['test_loss']):.4f}")

Best test loss (log MSE): 0.4473

Show code

plot_learning_curve(result_m1["history"], "M1")

7.2 M2 — hyperparameter sweep

In order to find the best hyperparameters for this model, we will loop through the “sweep” section of the lookup.

Show code

sweep_indices = lu_models[lu_models["stage"] == "sweep"].index.tolist()

print(sweep_indices)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]

Show code

sweep_results = []   # initialize
for idx in sweep_indices:
    config_row = lu_models.loc[idx]
    result = train_model(panel, config_row, eligibility_lookup, show_plot=False, verbose=False)
    sweep_results.append({
        "model_id":       result["model_id"],
        "seq_len":        int(config_row["seq_len"]),
        "hidden_size":    int(config_row["hidden_size"]),
        "learning_rate":  float(config_row["learning_rate"]),
        "mae":            result["mae"],
        "rmse":           result["rmse"],
        "best_test_loss": min(result["history"]["test_loss"]),
    })

sweep_df = pd.DataFrame(sweep_results)

## Sort by RMSE to get best fit
scrollable(sweep_df.sort_values("rmse").reset_index(drop=True))

model_id	seq_len	hidden_size	learning_rate	mae	rmse	best_test_loss
M2_07	4	64	0.0100	3.855698	7.634340	0.353938
M2_13	8	32	0.0100	4.171322	8.319045	0.388757
M2_04	4	32	0.0100	4.153130	8.767463	0.357772
M2_08	4	64	0.0010	4.318702	8.947997	0.388947
M2_18	8	64	0.0001	4.480975	9.049541	0.422520
M2_09	4	64	0.0001	4.428402	9.138783	0.408503
M2_01	4	16	0.0100	4.314425	9.192862	0.370301
M2_17	8	64	0.0010	4.515060	9.391414	0.404783
M2_11	8	16	0.0010	4.531479	9.445838	0.389287
M2_05	4	32	0.0010	4.422468	9.483552	0.393156
M2_25	12	64	0.0100	4.576123	9.505790	0.408910
M2_26	12	64	0.0010	4.615394	9.602592	0.423156
M2_27	12	64	0.0001	4.722316	9.676497	0.470032
M2_02	4	16	0.0010	4.474056	9.722293	0.390573
M2_14	8	32	0.0010	4.610190	9.852764	0.403625
M2_10	8	16	0.0100	4.404883	9.866540	0.387222
M2_06	4	32	0.0001	4.795813	9.906642	0.440181
M2_23	12	32	0.0010	4.700381	9.944742	0.447316
M2_20	12	16	0.0010	4.729808	10.052602	0.431354
M2_19	12	16	0.0100	4.748424	10.224441	0.405565
M2_15	8	32	0.0001	4.940846	10.311592	0.443992
M2_16	8	64	0.0100	4.549072	10.457841	0.380847
M2_22	12	32	0.0100	4.858771	10.999123	0.418311
M2_24	12	32	0.0001	5.300166	11.588200	0.484095
M2_03	4	16	0.0001	7.211026	16.825611	0.714947
M2_12	8	16	0.0001	7.483019	17.326490	0.790140
M2_21	12	16	0.0001	7.787136	17.897711	0.896242

Based on the results of the hyperparameter sweep, I chose M2_13 for the optimized run.

The items with a Seq Len of 12 performed consistently poorly - this makes sense given the limited data. - seq_len = 8 - hidden_size = 32 - learning_rate = 0.01

I have input these back into the lookup for Tuned M2 results, M3_1_1, M3_1_2, M3_2.

7.3. - Final Output of M2

Show code

config_m2 = lu_models[lu_models["model_id"] == "M2"].iloc[0]
result_m2 = train_model(panel, config_m2, eligibility_lookup)

M2 | Epoch 1/30 | Train: 1.0215 | Test: 0.4522
M2 | Epoch 2/30 | Train: 0.3378 | Test: 0.3991
M2 | Epoch 3/30 | Train: 0.3321 | Test: 0.4200
M2 | Epoch 4/30 | Train: 0.3272 | Test: 0.3888
M2 | Epoch 5/30 | Train: 0.3209 | Test: 0.4225
M2 | Epoch 6/30 | Train: 0.3089 | Test: 0.4147
M2 | Epoch 7/30 | Train: 0.3025 | Test: 0.4045
M2 | Epoch 8/30 | Train: 0.2997 | Test: 0.4169
M2 | Epoch 9/30 | Train: 0.2893 | Test: 0.4547
M2 | Epoch 10/30 | Train: 0.2846 | Test: 0.3928
M2 | Epoch 11/30 | Train: 0.2761 | Test: 0.4003
M2 | Epoch 12/30 | Train: 0.2747 | Test: 0.4311
M2 | Epoch 13/30 | Train: 0.2672 | Test: 0.3999
M2 | Epoch 14/30 | Train: 0.2645 | Test: 0.4476
M2 | Epoch 15/30 | Train: 0.2632 | Test: 0.4535
M2 | Epoch 16/30 | Train: 0.2597 | Test: 0.4288
M2 | Epoch 17/30 | Train: 0.2661 | Test: 0.3972
M2 | Epoch 18/30 | Train: 0.2613 | Test: 0.4126
M2 | Epoch 19/30 | Train: 0.2597 | Test: 0.4450
M2 | Epoch 20/30 | Train: 0.2518 | Test: 0.4537
M2 | Epoch 21/30 | Train: 0.2497 | Test: 0.4512
M2 | Epoch 22/30 | Train: 0.2451 | Test: 0.4356
M2 | Epoch 23/30 | Train: 0.2430 | Test: 0.4237
M2 | Epoch 24/30 | Train: 0.2421 | Test: 0.4238
M2 | Epoch 25/30 | Train: 0.2359 | Test: 0.4347
M2 | Epoch 26/30 | Train: 0.2317 | Test: 0.4218
M2 | Epoch 27/30 | Train: 0.2329 | Test: 0.4421
M2 | Epoch 28/30 | Train: 0.2315 | Test: 0.4189
M2 | Epoch 29/30 | Train: 0.2280 | Test: 0.4545
M2 | Epoch 30/30 | Train: 0.2243 | Test: 0.4574

Show code

print("\n=== M2 final ===")


=== M2 final ===

Show code

print(f"M2 final | MAE: {result_m2['mae']:.3f} | RMSE: {result_m2['rmse']:.3f}")

M2 final | MAE: 4.171 | RMSE: 8.319

Show code

plot_learning_curve(result_m2["history"], "M2")

7.4. M3.1.1 — Late-window transfer - Retain full Hospital list

This model will compare the results of the early window to the results of the full window, using the same hyperparameters.

Show code

config_m3_1_1 = lu_models[lu_models["model_id"] == "M3_1_1"].iloc[0]
result_m3_1_1 = train_model(panel, config_m3_1_1, eligibility_lookup)

M3_1_1 | Epoch 1/30 | Train: 0.4323 | Test: 0.3371
M3_1_1 | Epoch 2/30 | Train: 0.3184 | Test: 0.3282
M3_1_1 | Epoch 3/30 | Train: 0.3184 | Test: 0.3345
M3_1_1 | Epoch 4/30 | Train: 0.3046 | Test: 0.3139
M3_1_1 | Epoch 5/30 | Train: 0.2969 | Test: 0.3193
M3_1_1 | Epoch 6/30 | Train: 0.2912 | Test: 0.3062
M3_1_1 | Epoch 7/30 | Train: 0.2905 | Test: 0.3447
M3_1_1 | Epoch 8/30 | Train: 0.2915 | Test: 0.3074
M3_1_1 | Epoch 9/30 | Train: 0.2865 | Test: 0.3076
M3_1_1 | Epoch 10/30 | Train: 0.2885 | Test: 0.3012
M3_1_1 | Epoch 11/30 | Train: 0.2881 | Test: 0.3042
M3_1_1 | Epoch 12/30 | Train: 0.2881 | Test: 0.3124
M3_1_1 | Epoch 13/30 | Train: 0.2903 | Test: 0.3114
M3_1_1 | Epoch 14/30 | Train: 0.2861 | Test: 0.3017
M3_1_1 | Epoch 15/30 | Train: 0.2837 | Test: 0.3050
M3_1_1 | Epoch 16/30 | Train: 0.2875 | Test: 0.3148
M3_1_1 | Epoch 17/30 | Train: 0.2822 | Test: 0.3070
M3_1_1 | Epoch 18/30 | Train: 0.2846 | Test: 0.3022
M3_1_1 | Epoch 19/30 | Train: 0.2842 | Test: 0.3412
M3_1_1 | Epoch 20/30 | Train: 0.2825 | Test: 0.3027
M3_1_1 | Epoch 21/30 | Train: 0.2831 | Test: 0.3099
M3_1_1 | Epoch 22/30 | Train: 0.2837 | Test: 0.3149
M3_1_1 | Epoch 23/30 | Train: 0.2793 | Test: 0.3057
M3_1_1 | Epoch 24/30 | Train: 0.2808 | Test: 0.3031
M3_1_1 | Epoch 25/30 | Train: 0.2813 | Test: 0.3137
M3_1_1 | Epoch 26/30 | Train: 0.2796 | Test: 0.3144
M3_1_1 | Epoch 27/30 | Train: 0.2755 | Test: 0.3136
M3_1_1 | Epoch 28/30 | Train: 0.2781 | Test: 0.3154
M3_1_1 | Epoch 29/30 | Train: 0.2787 | Test: 0.3131
M3_1_1 | Epoch 30/30 | Train: 0.2786 | Test: 0.3146

Show code

print("\n=== M3.1.1 final ===")


=== M3.1.1 final ===

Show code

print(f"M3.1.1 final | MAE: {result_m3_1_1['mae']:.3f} | RMSE: {result_m3_1_1['rmse']:.3f}")

M3.1.1 final | MAE: 3.021 | RMSE: 6.062

Show code

plot_learning_curve(result_m3_1_1["history"], "M3.1 - All Hospitals")

7.5. M3.1.2 — Late-window transfer - Reduced Hospital List

This model is trained off less data, for the full range. Expect to see loss of predictive value. Hopefully able to introduce more data later.

Show code

config_m3_1_2 = lu_models[lu_models["model_id"] == "M3_1_2"].iloc[0]
result_m3_1_2 = train_model(panel, config_m3_1_2, eligibility_lookup)

M3_1_2 | Epoch 1/30 | Train: 0.5007 | Test: 0.3424
M3_1_2 | Epoch 2/30 | Train: 0.3281 | Test: 0.3321
M3_1_2 | Epoch 3/30 | Train: 0.3157 | Test: 0.3209
M3_1_2 | Epoch 4/30 | Train: 0.3072 | Test: 0.3139
M3_1_2 | Epoch 5/30 | Train: 0.3058 | Test: 0.3115
M3_1_2 | Epoch 6/30 | Train: 0.2996 | Test: 0.3053
M3_1_2 | Epoch 7/30 | Train: 0.2975 | Test: 0.3115
M3_1_2 | Epoch 8/30 | Train: 0.3000 | Test: 0.3049
M3_1_2 | Epoch 9/30 | Train: 0.2975 | Test: 0.3162
M3_1_2 | Epoch 10/30 | Train: 0.2963 | Test: 0.3060
M3_1_2 | Epoch 11/30 | Train: 0.2940 | Test: 0.3095
M3_1_2 | Epoch 12/30 | Train: 0.2951 | Test: 0.3098
M3_1_2 | Epoch 13/30 | Train: 0.2940 | Test: 0.3049
M3_1_2 | Epoch 14/30 | Train: 0.2937 | Test: 0.3012
M3_1_2 | Epoch 15/30 | Train: 0.2942 | Test: 0.3100
M3_1_2 | Epoch 16/30 | Train: 0.2910 | Test: 0.2991
M3_1_2 | Epoch 17/30 | Train: 0.2939 | Test: 0.2990
M3_1_2 | Epoch 18/30 | Train: 0.2891 | Test: 0.2999
M3_1_2 | Epoch 19/30 | Train: 0.2890 | Test: 0.3012
M3_1_2 | Epoch 20/30 | Train: 0.2918 | Test: 0.3024
M3_1_2 | Epoch 21/30 | Train: 0.2901 | Test: 0.3017
M3_1_2 | Epoch 22/30 | Train: 0.2875 | Test: 0.3066
M3_1_2 | Epoch 23/30 | Train: 0.2879 | Test: 0.3037
M3_1_2 | Epoch 24/30 | Train: 0.2871 | Test: 0.3084
M3_1_2 | Epoch 25/30 | Train: 0.2863 | Test: 0.3089
M3_1_2 | Epoch 26/30 | Train: 0.2847 | Test: 0.3111
M3_1_2 | Epoch 27/30 | Train: 0.2821 | Test: 0.3106
M3_1_2 | Epoch 28/30 | Train: 0.2815 | Test: 0.3158
M3_1_2 | Epoch 29/30 | Train: 0.2807 | Test: 0.3125
M3_1_2 | Epoch 30/30 | Train: 0.2796 | Test: 0.3228

Show code

print("\n=== M3.1.2 final ===")


=== M3.1.2 final ===

Show code

print(f"M3.1.2 final | MAE: {result_m3_1_2['mae']:.3f} | RMSE: {result_m3_1_2['rmse']:.3f}")

M3.1.2 final | MAE: 2.496 | RMSE: 5.132

Show code

plot_learning_curve(result_m3_1_2["history"], "M3.1 - Limited Hospital List")

7.6. M3.2 — Late-window data - reduced hospital list + Wastewater Signal

Show code

config_m3_2 = lu_models[lu_models["model_id"] == "M3_2"].iloc[0]
result_m3_2 = train_model(panel, config_m3_2, eligibility_lookup)

M3_2 | Epoch 1/30 | Train: 0.4871 | Test: 0.3684
M3_2 | Epoch 2/30 | Train: 0.3318 | Test: 0.3505
M3_2 | Epoch 3/30 | Train: 0.3230 | Test: 0.3413
M3_2 | Epoch 4/30 | Train: 0.3139 | Test: 0.3379
M3_2 | Epoch 5/30 | Train: 0.3060 | Test: 0.3189
M3_2 | Epoch 6/30 | Train: 0.2968 | Test: 0.3076
M3_2 | Epoch 7/30 | Train: 0.2948 | Test: 0.3106
M3_2 | Epoch 8/30 | Train: 0.2920 | Test: 0.3013
M3_2 | Epoch 9/30 | Train: 0.2862 | Test: 0.3045
M3_2 | Epoch 10/30 | Train: 0.2886 | Test: 0.2970
M3_2 | Epoch 11/30 | Train: 0.2855 | Test: 0.3157
M3_2 | Epoch 12/30 | Train: 0.2841 | Test: 0.2957
M3_2 | Epoch 13/30 | Train: 0.2843 | Test: 0.3113
M3_2 | Epoch 14/30 | Train: 0.2847 | Test: 0.3116
M3_2 | Epoch 15/30 | Train: 0.2859 | Test: 0.2968
M3_2 | Epoch 16/30 | Train: 0.2834 | Test: 0.3089
M3_2 | Epoch 17/30 | Train: 0.2796 | Test: 0.3049
M3_2 | Epoch 18/30 | Train: 0.2791 | Test: 0.2999
M3_2 | Epoch 19/30 | Train: 0.2813 | Test: 0.2997
M3_2 | Epoch 20/30 | Train: 0.2805 | Test: 0.3172
M3_2 | Epoch 21/30 | Train: 0.2810 | Test: 0.2976
M3_2 | Epoch 22/30 | Train: 0.2820 | Test: 0.3060
M3_2 | Epoch 23/30 | Train: 0.2756 | Test: 0.3012
M3_2 | Epoch 24/30 | Train: 0.2738 | Test: 0.3113
M3_2 | Epoch 25/30 | Train: 0.2763 | Test: 0.3222
M3_2 | Epoch 26/30 | Train: 0.2745 | Test: 0.3005
M3_2 | Epoch 27/30 | Train: 0.2808 | Test: 0.3075
M3_2 | Epoch 28/30 | Train: 0.2788 | Test: 0.3170
M3_2 | Epoch 29/30 | Train: 0.2700 | Test: 0.3057
M3_2 | Epoch 30/30 | Train: 0.2681 | Test: 0.3168

Show code

print("\n=== M3.2 final ===")


=== M3.2 final ===

Show code

print(f"M3.2 final | MAE: {result_m3_2['mae']:.3f} | RMSE: {result_m3_2['rmse']:.3f}")

M3.2 final | MAE: 2.500 | RMSE: 5.148

Show code

plot_learning_curve(result_m3_2["history"], "M3.2 - Hospitals with WW Data")

7.7. Export Model Results (by SW) to Outputs Folder

Show code


def per_sw_mae(result):
    int_to_sw_id = {v: k for k, v in result["sw_id_to_int"].items()}
    df = pd.DataFrame({
        "sw_idx": result["sw_idx"],
        "abs_err": np.abs(result["preds"] - result["targets"]),
        "actual": result["targets"],
    })
    out = (df.groupby("sw_idx")
           .agg(mae=("abs_err", "mean"),
                n_test_weeks=("abs_err", "size"),
                mean_actual=("actual", "mean"))
           .reset_index())
    out["sw_id"] = out["sw_idx"].map(int_to_sw_id)
    return out[["sw_id", "mae", "n_test_weeks", "mean_actual"]]


per_sw_mae(result_m1).to_csv("data/outputs/per_sw_mae_m1.csv", index=False)
per_sw_mae(result_m2).to_csv("data/outputs/per_sw_mae_m2.csv", index=False)
per_sw_mae(result_m3_1_1).to_csv("data/outputs/per_sw_mae_m3_1_1.csv", index=False)
per_sw_mae(result_m3_1_2).to_csv("data/outputs/per_sw_mae_m3_1_2.csv", index=False)
per_sw_mae(result_m3_2  ).to_csv("data/outputs/per_sw_mae_m3_2.csv",   index=False)

7.8. Input Back into R

Show code

results <- list(
  m1     = read.csv("data/outputs/per_sw_mae_m1.csv"),
  m2     = read.csv("data/outputs/per_sw_mae_m2.csv"),
  m3_1_1 = read.csv("data/outputs/per_sw_mae_m3_1_1.csv"),
  m3_1_2 = read.csv("data/outputs/per_sw_mae_m3_1_2.csv"),
  m3_2   = read.csv("data/outputs/per_sw_mae_m3_2.csv")
)

Part 8. Evaluation

8.1. Final Model Comparison

Show code

comparison = pd.DataFrame([
    {"Model": "M1",     "Window": "2020-09 → 2021-09", "Sewersheds": 108,
     "Features": "hosp-only",         "MAE": result_m1["mae"],     "RMSE": result_m1["rmse"]},
    {"Model": "M2",     "Window": "2020-09 → 2021-09", "Sewersheds": 108,
     "Features": "hosp-only (tuned)", "MAE": result_m2["mae"],     "RMSE": result_m2["rmse"]},
    {"Model": "M3.1.1", "Window": "2023-01 → 2024-12", "Sewersheds": 108,
     "Features": "hosp-only",         "MAE": result_m3_1_1["mae"], "RMSE": result_m3_1_1["rmse"]},
    {"Model": "M3.1.2", "Window": "2023-01 → 2024-12", "Sewersheds": 69,
     "Features": "hosp-only",         "MAE": result_m3_1_2["mae"], "RMSE": result_m3_1_2["rmse"]},
    {"Model": "M3.2",   "Window": "2023-01 → 2024-12", "Sewersheds": 69,
     "Features": "hosp + WW",         "MAE": result_m3_2["mae"],   "RMSE": result_m3_2["rmse"]},
])

comparison["MAE"]  = comparison["MAE"].round(3)
comparison["RMSE"] = comparison["RMSE"].round(3)

scrollable(comparison)

Model	Window	Sewersheds	Features	MAE	RMSE
M1	2020-09 → 2021-09	108	hosp-only	4.700	9.945
M2	2020-09 → 2021-09	108	hosp-only (tuned)	4.171	8.319
M3.1.1	2023-01 → 2024-12	108	hosp-only	3.021	6.062
M3.1.2	2023-01 → 2024-12	69	hosp-only	2.496	5.132
M3.2	2023-01 → 2024-12	69	hosp + WW	2.500	5.148

Research Question 1.

1a. M1, the untuned model, was set with a sequence window of 12 weeks, a learning rate of 0.001, and a hidden size of 32. M2, the sweep winner, used a sequence length of 8, a learning rate of 0.01, and a hidden size of 32 — same architecture, just smaller window and faster learning rate. M1’s typical prediction was off by ~4.7 admissions; M2 reduced that to ~4.2.

The hyperparameter sweep did not produce a singular obvious winner — MAEs across configurations clustered closely, and the ranking was sensitive to seed-to-seed variation (seed 508 vs seed 408 produced different sweep winners). However, the 12-week sequence configurations were consistently among the worst-performing across both seeds, likely because every additional week of lookback also reduces the number of training samples available — costly when the early-window dataset already has only ~52 weeks per sewershed.

1b.

The greatest model improvement was seen comparing M2 to M3.1.1 — both use the same 108-sewershed eligibility set and the same hyperparameters, but M3.1.1 trains on roughly twice as many weeks of data (the late-window 2023-2024 period vs M2’s early-window 2020-2021). M3.1.1’s average prediction error was ~3 admissions, compared to M2’s ~4.2.

Research Question 2.

The addition of wastewater surveillance data in M3.2 does not appear to have improved predictive power compared to M3.1.2. Trained on the same sewersheds, with the same hyperparameters, in the same date range, both models produced nearly identical MAEs (~2.5 admissions/week). Seed-to-seed comparison (seed 508 vs 408) reinforces this null finding: under the alternate seed, M3.2 was nominally better than M3.1.2 by a comparably small margin in the opposite direction. The difference between including and excluding wastewater is well within the noise floor produced by training stochasticity alone.

8.2. Per-sewershed error maps

8.2.1. Comparison Between M3.1.1 (108 sewersheds) vs M3.1.2 (69 sewersheds)

Show code

mae_compare <- bind_rows(
  results$m3_1_1 |> mutate(model = "M3.1.1: 108 sw, hosp-only"),
  results$m3_1_2 |> mutate(model = "M3.1.2: 69 sw, hosp-only")
)

# Cross polygons with both model labels so M3.1.2's missing sewersheds show as grey
map_sewer_compare <- map_lu_sewer |>
  select(sw_id) |>
  tidyr::crossing(model = unique(mae_compare$model)) |>
  left_join(mae_compare, by = c("sw_id", "model")) |>
  st_as_sf()


ggplot(map_sewer_compare) +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey80",
          inherit.aes = FALSE) +
  geom_sf(aes(fill = mae), color = "white", linewidth = 0.1) +
  facet_wrap(~ model, ncol = 2) +
  scale_fill_viridis_c(option = "magma", direction = -1,
                       name = "MAE\n(admissions)",
                       na.value = "grey85") +
  labs(title    = "Per-sewershed forecast error: full vs WW-eligible subset",
       subtitle = "Grey = sewershed excluded from M3.1.2 (no WW coverage ≥ 70%)") +
  theme_void() +
  theme(strip.text = element_text(size = 11))

This map confirms that the 39 sewersheds excluded from the WW-eligible subset cluster in Long Island, parts of NYC, and the Buffalo area. Some of these missing sewersheds appear to have some of the highest error rates in the M3.1.1 model - which aligns with the observation from the full model comparison earlier - some of the sewersheds that were dropped for wastewater surveillance coverage in the analysis are the most difficult sewersheds to predict.

8.2.2. Error Map of Final WWTP Model

Show code

# Per-sewershed delta: positive = WW hurt, negative = WW helped
delta <- results$m3_1_2 |>
  select(sw_id, mae_hosp_only = mae) |>
  inner_join(
    results$m3_2 |> select(sw_id, mae_with_ww = mae),
    by = "sw_id"
  ) |>
  mutate(delta_mae = mae_with_ww - mae_hosp_only)

# Quick sanity print before mapping — useful for the video narration
print(summary(delta$delta_mae))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5.277017 -0.064053 -0.002640  0.004293  0.058099  5.193978

Show code

print(paste("WW helped:", sum(delta$delta_mae < 0), "sewersheds"))

[1] "WW helped: 35 sewersheds"

Show code

print(paste("WW hurt:",   sum(delta$delta_mae > 0), "sewersheds"))

[1] "WW hurt: 34 sewersheds"

Show code

# Join to polygons
map_delta <- map_lu_sewer |>
  inner_join(delta, by = "sw_id")

# Diverging scale, centered at zero
delta_max <- max(abs(map_delta$delta_mae), na.rm = TRUE)

ggplot(map_delta) +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey80",
          inherit.aes = FALSE) +
  geom_sf(aes(fill = delta_mae), color = "white", linewidth = 0.1) +
  scale_fill_gradient2(
    low = "#1b7837", mid = "grey95", high = "#762a83",
    midpoint = 0, limits = c(-delta_max, delta_max),
    name = "Δ MAE\n(WW − hosp-only)"
  ) +
  labs(
    title    = "Marginal effect of wastewater signal: M3.2 − M3.1.2",
    subtitle = "Green = WW improved forecast; purple = WW hurt; grey = no change",
    caption  = "Same 69 sewersheds, same hyperparameters, only feature set differs"
  ) +
  theme_void()

Most of New York State sits at or near zero on this map — adding wastewater data shifted per-sewershed forecasts by less than half an admission per week in the vast majority of sewersheds. A handful of sewersheds show larger swings (~5 admissions), with one large positive shift near NYC (WW worsened the forecast there) and one large negative shift in central NY (WW improved it).

However, comparing this map to the equivalent map produced under a different seed (seed 408) reveals that these per-sewershed shifts are not stable: individual sewersheds change color and magnitude substantially between runs, while the pooled MAE remains essentially unchanged. The cleanest interpretation of these changing results is that the per-sewershed effects reflect optimization variability rather than a stable wastewater effect — adding WW surveillance does not consistently help or hurt forecasts at any specific location.

Combined with the near-zero pooled difference in 8.1, the spatial story confirms the headline finding: wastewater surveillance data does not measurably improve sewershed-level COVID-19 hospitalization forecasting beyond what hospital occupancy features already provide in these models.