Final Project Data Preparation

AGOG 508: GeoAI For SmartCities

Author

Alex Mossawir

Research Questions

All models forecast new COVID-19 hospital admissions two weeks ahead at the week + sewershed level. Train/test split is 80/20 chronological — the most recent 20% of weeks in each model’s window are held out for evaluation.

Question 1. How much historical covid hospitalization data is needed to train a model with good covid hospital admissions prediction performance at the sewershed level?

Part 1a compares M1 and M2 within the short, early-pandemic window (September 2020 – September 2021), when admissions volumes were high and signal was strong. The goal is to identify the best training parameters via hyperparameter sweep.

  • M1 is the baseline: a pooled LSTM trained on hospital admissions data, with input_size = 4 (new admissions, current patients, ICU patients, and a missing mask to identify forward-filled weeks) and new admissions as the prediction target. Default hyperparameters.

  • M2 is the winner of a hyperparameter sweep over the following grid, trained on the same sewersheds and date range as M1:

    • Sequence length: 4, 8, or 12 weeks
    • Learning rate: 0.01, 0.001, 0.0001
    • Hidden size: 16, 32, 64

Part 1b compares M2 (early window, 108 sewersheds) against M3.1.1 (late window, January 2023 – December 2024, same 108 sewersheds, M2’s chosen hyperparameters transferred without re-optimization). This isolates the effect of data window — does training on a longer, later, more stably-instrumented period improve forecasting?

Question 2. Does SARS-CoV-2 wastewater surveillance data improve sewershed-level COVID hospitalization forecasting beyond what hospital occupancy features already provide?

This question is addressed by comparing M3.1.2 and M3.2 on a matched set of 69 sewersheds — those with both ≥1 reporting hospital and ≥70% wastewater coverage during the late window.

  • M3.1.2 uses the same hospital-only feature set as M3.1.1 (input_size = 4), restricted to the 69 wastewater-eligible sewersheds. This restriction makes the comparison against M3.2 apples-to-apples.

  • M3.2 adds two wastewater channel — vp_conc_log_mean, the log-transformed mean SARS-CoV-2 concentration per sewershed-week, and an informational missing mask to indicate what was forward-filled — bringing input_size = 6. All other architecture and hyperparameters match M3.1.2.

The contrast between M3.1.2 and M3.2 isolates the marginal contribution of wastewater signal to forecasting accuracy.

Data Sources:

Hospital Electronic Response Data System (HERDS) Hospital Survey: COVID-19 Hospitalizations and Beds https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Hospitalizations/jw46-jpb7/about_data

New York State Statewide COVID-19 Wastewater Surveillance Data https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Wastewater-Surve/hdxs-icuh/about_data

New York State Sewershed Polygons github.com/nys-wwsn/ny-sewer-spatial-data

Latitude and Longitudes of Hospitals to Spatially Join to Sewershed Data https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r/about_data

Part 0: Data Loading and Setup

0.1. Load Required Packages

  • Required packages for data cleaning: tidyverse
  • Required packages for presentation: knitr, kableExtra
  • Required packages for data retrieval: vroom

0.2 Import Run Mapping Information

Show code
## Date Ranges
window_start <- as.Date("2020-03-26")  # earliest hosp date
window_end   <- as.Date("2025-10-06")  # latest hosp date


#   Import Project Lookup
#--------------------------------------------------------------------------------------------------

lu_run <- read_excel("dictionaries/508_Final Project_Wastewater Monitoring_Lookup_AMossawir.xlsx")
##Print LU for Reference
lu_run %>%
kable(
    caption = "Project Lookup"
  ) 
Project Lookup
run_id lu_year lu_url lu_dict limit
hosp hosp https://health.data.ny.gov/api/views/jw46-jpb7/rows.csv?accessType=DOWNLOAD dict_hosp NA
waste wwtp https://health.data.ny.gov/api/views/hdxs-icuh/rows.csv?accessType=DOWNLOAD dict_wwtp NA

0.3. Generate Necessary Fields for Downloading CSV

New Columns: - Cache: Location of cached data - Cache_ts: Time that CSV was originally pulled from OData source

Show code
#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data, to avoid re-downloading every time this
#     gets knit.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run$cache <- file.path(paste0("data/raw/",lu_run$run_id,".rds"))

## File Timestamp saved to
lu_run$cache_ts <- file.path(paste0("data/raw/",lu_run$run_id,"_ts.txt"))

0.4. Import Data

0.4.1. Helper Function - import_fetch or load

This function checks if the document specified in the run mapping is currently cached within the project folder. If it is, it loads the data into the LU directly. If it is not, it pulls the number of rows specified in the document (or all) from the url in the lookup, caches the new file, and loads into the LU.

This function will be called on an entire row of the LU at a time - The following columns will always need to be in the LU for this to work: - cache - relative file path + name of .rds for each dataset imported - cache_ts - relative file path + name of .txt file with time that the url was queried - lu_url - odata URL where data is being pulled from

Show code
import_fetchorload <- function(LU){ 
  
  ## Check if the cached data exists 
if(file.exists(LU$cache[[1]])) {
  ## If so - print the following information:
  print(
  paste0(
    "Loading data from ", LU$cache[[1]],
    " (saved at ", read_lines(LU$cache_ts[[1]])[[1]], ")",
    " - saving to df."
        )
      )
  ## Read in data from cached location and ts
  LU$data [[1]]<- readRDS(LU$cache[[1]])
  LU$data_timestamp <- read_lines(LU$cache_ts[[1]])
  return(LU)
  
} else {
  ##Print the message
    print(
      paste0(
        "No data saved at ", as.character(LU$cache[[1]]), 
        "- loading from url: ",as.character(LU$lu_url[[1]])
          )
        )
  if (is.na(LU$limit[[1]])==TRUE) {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]])))
  } else {
      temp <- as.data.frame(vroom(as.character(LU$lu_url[[1]]), n_max = LU$limit[[1]]))}
  ts <- strftime(Sys.time())
  cat(ts, file = LU$cache_ts[[1]])
  saveRDS(temp,LU$cache[[1]])
  LU$data_timestamp <- ts
  LU$data[[1]] <- temp
  return(LU)
}}

0.4.2. Loop through LU to import data

Final output will be a series of messages in console indicating whether it needed to query the URL or whether it was able to just pull from the DF directly. Data frames will be stored within the LU table that we’ve been operating on.

Show code
#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#  
#--------------------------------------------------------------------------------------------------

lu_run <- rowwise(lu_run) %>% 
  mutate(import_fetchorload(cur_data()))
[1] "Loading data from data/raw/hosp.rds (saved at 2026-04-28 09:36:59) - saving to df."
[1] "Loading data from data/raw/waste.rds (saved at 2026-04-28 09:37:50) - saving to df."

0.5. Import all Dictionaries

Show code
#   Import dictionaries
#--------------------------------------------------------------------------------------------------
#   This can be used to add additional years or remove years more quickly. building this process 
#   on a subset of 20 rows from 2016 and 2017, and then revising to add in the full year range 
#   and the full datasets once workflows are clean, so it can be batched.
#   
#   This is designed so that each dictionary is sourced from a different tab in an excel document.
#   
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "dictionaries/508_Final Project_dictionaries_AMossawir.xlsx"
#--------------------------------------------------------------------------------------------------
 
##pull years out of the run lu to loop over

year <- as.character(lu_run$lu_dict)
##create a dictionary list of sheets read from excel, looping over years
dicts <- lapply(year, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict,sheet=row)

})

Loading dictionary from:dictionaries/508_Final Project_dictionaries_AMossawir.xlsx
Read sheet: dict_hosp
Loading dictionary from:dictionaries/508_Final Project_dictionaries_AMossawir.xlsx
Read sheet: dict_wwtp
Show code
## set the names of the objects using years
dicts <- setNames(dicts,year)

rm(year, dict)

Part 1. Data Cleaning and Processing

1.1. Column Rename and Type Classification

1.1.1. Helper Function - standardize_columns

Show code
#   Standardize Columns Function
#-------------------------------------------------------------------------------------------------
#   This is a function that standardizes the columns based on the provided dictionaries -
#   If this fails, it means that there is something that does not match in the dictionary.
#   
#   This is by design - I want it to be able to account for every single row, if something is
#   mapped incorrectly in the dataset, then it neeeds a new dictionary.
#
#   For example, while making this, I copy/pasted the rows from a print statement, but I
#   dropped one of the underscores on "__id" for the 2017, but not for the 2016 data. 
#
#--------------------------------------------------------------------------------------------------
 
standardize_columns <- function(run_lu, dictionaries){ 
    w_data <- run_lu$data[[1]]
    w_dict <- dictionaries[[ run_lu$lu_dict[[1]] ]]  
    
    w_names <- setNames(w_dict$api_field, w_dict$final_name)
    
    temp <- w_data %>%
        rename(!!!w_names) %>%
        mutate(
            across(
                all_of(w_dict$final_name[w_dict$final_type == "character"]),
                as.character
            ),
            across(
                all_of(w_dict$final_name[w_dict$final_type == "numeric"]),
                as.numeric
            ),
            across(
               all_of(w_dict$final_name[w_dict$final_type == "date"]),
                lubridate::mdy
               #~ as_date(substr(.x, 1, 10))
            ),
            across(
                all_of(w_dict$final_name[w_dict$final_type == "logical"]),
                as.logical
            )
        )
    run_lu$a_data[[1]] <- temp
    
    return(run_lu)
}

1.1.2. Loop through datasets to standardize

Show code
#   Run through entire lookup, execute function
#--------------------------------------------------------------------------------------------------
#   If this fails, check the console for issues with the LUs not matching.
#--------------------------------------------------------------------------------------------------    
lu_run <- rowwise(lu_run) %>% 
  mutate(standardize_columns(cur_data(),dicts))

Hospitals

v_date v_pfi v_facility region county network nyforwardregion v_currentpatients admissions admissions_noncov admissions_new admissions_positive patients_discharged v_patients_icu patients_icu_intubation patients_expired discharges_cum fatalities_cum beds beds_available beds_icu beds_icu_available beds_acute beds_acute_occupied beds_icu_1 beds_icu_occupied v_admissions_totalnew patients_age_u1 patients_age_1to4 patients_age_5to19 patients_age_20to44 patients_age_45to54 patients_age_55to64 patients_age_65to74 patients_age_75to84 patients_age_85plus hospitalized_indicator
2025-10-06 000727 OSWEGO HOSPITAL CENTRAL NEW YORK REGIONAL OFFICE OSWEGO INDEPENDENT CENTRAL NEW YORK 1 1 0 0 0 0 0 0 0 1101 82 0 0 0 0 55 42 8 6 0 0 0 0 0 0 0 0 0 0 0
2025-10-06 000812 GOUVERNEUR HOSPITAL CENTRAL NEW YORK REGIONAL OFFICE ST. LAWRENCE ST. LAWRENCE HEALTH SYSTEM NORTH COUNTRY 0 0 0 0 0 0 0 0 0 134 3 0 0 0 0 25 12 0 0 0 0 0 0 0 0 0 0 0 0 0
2025-10-06 000412 THE UNITY HOSPITAL OF ROCHESTER - ST. MARY'S CAMPUS WESTERN REGIONAL OFFICE MONROE ROCHESTER REGIONAL HEALTH SYSTEM FINGER LAKES 0 0 0 0 0 0 0 0 0 73 0 0 0 0 0 32 0 0 0 0 0 0 0 0 0 0 0 0 0 0

WWTPs

county zip v_totalpop v_samplelocation samplelocation_long v_epaid v_wwtp CapacityMGD StormwaterInput SampleType v_samplematrix Pretreatment SolidSeparation ConcentrationMethod ExtractionMethod RecEffTargetName RecEffSpikeMatrix RecEffSpikeConc PCRTarget PCRGeneTarget PCRGeneTargetRef PCRType LodRef HumFracTargetMic HumFracTargetMicRef QuantStanType StanRef InhibitionMethod NumNoTargetControl v_date v_flowrate SampleID LabID v_testdate v_conc flag_belowlod v_lodsewage flag_ntcamplify v_receffpct InhibitionDetect InhibitionAdjust v_humfracmicconc flag_qc
Albany 12158 31023 wwtp NA NY0025739 Town of Bethlehem - Cedar Hill WPCP 6 yes 24-hr time-weighted composite raw wastewater no centrifugation ultracentrifugation zymo quick-rna fungal/bacterial miniprep #r2014 bcov vaccine raw sample 1e+06 sars-cov-2 ip2 and ip4 combined Chavarria-Miró G, Anfruns-Estrada E, Martínez-Velázquez A, Vázquez-Portero M, Guix S, Paraira M, Galofré B, Sánchez G, Pintó RM, Bosch A. Time Evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Wastewater during the First Pandemic Wave of COVID-19 in the Metropolitan Area of Barcelona, Spain. Appl Environ Microbiol. 2021 Mar 11;87(7):e02750-20. doi: 10.1128/AEM.02750-20. PMID: 33483313; PMCID: PMC8091622. qPCR Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, Brünink S, Schneider J, Schmidt ML, Mulders DG. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045. Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100. crassphage Stachler E, Kelty C, Sivaganesan M, Li X, Bibby K, Shanks OC. Quantitative CrAssphage PCR assays for human fecal pollution measurement. Environmental science & technology. 2017 Aug 15;51(16):9146-54. DNA Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100. Green HC, Field KG. Sensitive detection of sample interference in environmental qPCR. Water research. 2012 Jun 15;46(10):3251-60. 3 2024-10-15 2.45 20241015NY001025739A Quadrant 2024-10-16 0 NA 800 FALSE 0.94 no no 59200000 no
Albany 12158 31023 wwtp NA NY0025739 Town of Bethlehem - Cedar Hill WPCP 6 yes 24-hr time-weighted composite raw wastewater no centrifugation ultracentrifugation zymo quick-rna fungal/bacterial miniprep #r2014 bcov vaccine raw sample 1e+06 sars-cov-2 ip2 and ip4 combined Chavarria-Miró G, Anfruns-Estrada E, Martínez-Velázquez A, Vázquez-Portero M, Guix S, Paraira M, Galofré B, Sánchez G, Pintó RM, Bosch A. Time Evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Wastewater during the First Pandemic Wave of COVID-19 in the Metropolitan Area of Barcelona, Spain. Appl Environ Microbiol. 2021 Mar 11;87(7):e02750-20. doi: 10.1128/AEM.02750-20. PMID: 33483313; PMCID: PMC8091622. qPCR Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, Brünink S, Schneider J, Schmidt ML, Mulders DG. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045. Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100. crassphage Stachler E, Kelty C, Sivaganesan M, Li X, Bibby K, Shanks OC. Quantitative CrAssphage PCR assays for human fecal pollution measurement. Environmental science & technology. 2017 Aug 15;51(16):9146-54. DNA Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100. Green HC, Field KG. Sensitive detection of sample interference in environmental qPCR. Water research. 2012 Jun 15;46(10):3251-60. 3 2023-01-24 5.33 20230124NY001025739A Quadrant 2023-01-25 101000 NA 800 FALSE 3.45 no no 45800000 no
Albany 12204 109426 wwtp NA NY0026875 Albany County WPD - North Plant 35 yes 24-hr time-weighted composite raw wastewater no centrifugation ultracentrifugation zymo quick-rna fungal/bacterial miniprep #r2014 bcov vaccine raw sample 1e+06 sars-cov-2 ip2 and ip4 combined Chavarria-Miró G, Anfruns-Estrada E, Martínez-Velázquez A, Vázquez-Portero M, Guix S, Paraira M, Galofré B, Sánchez G, Pintó RM, Bosch A. Time Evolution of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) in Wastewater during the First Pandemic Wave of COVID-19 in the Metropolitan Area of Barcelona, Spain. Appl Environ Microbiol. 2021 Mar 11;87(7):e02750-20. doi: 10.1128/AEM.02750-20. PMID: 33483313; PMCID: PMC8091622. qPCR Corman VM, Landt O, Kaiser M, Molenkamp R, Meijer A, Chu DK, Bleicker T, Brünink S, Schneider J, Schmidt ML, Mulders DG. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR. Eurosurveillance. 2020 Jan 23;25(3):2000045. Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100. crassphage Stachler E, Kelty C, Sivaganesan M, Li X, Bibby K, Shanks OC. Quantitative CrAssphage PCR assays for human fecal pollution measurement. Environmental science & technology. 2017 Aug 15;51(16):9146-54. DNA Wilder ML, Middleton F, Larsen DA, Du Q, Fenty A, Zeng T, Insaf T, Kilaru P, Collins M, Kmush B, Green HC. Co-quantification of crAssphage increases confidence in wastewater-based epidemiology for SARS-CoV-2 in low prevalence areas. Water research X. 2021 May 1;11:100100. Green HC, Field KG. Sensitive detection of sample interference in environmental qPCR. Water research. 2012 Jun 15;46(10):3251-60. 3 2023-06-05 20.00 20230605NY001026875A Quadrant 2023-06-06 0 NA 800 FALSE 14.51 no no 8050000 no

1.2. Relevant Columns Only

Show code
prefixes <- c("v_", "flag_")
colfilter <- function(w_data, prefix_list) {
  w_data <- w_data |>
    select(starts_with(prefix_list))
}

lu_run$a_data <- map(lu_run$a_data, ~ colfilter(.x, prefixes))

rm(prefixes)

Part 2: Data Subsetting on Exclusion Criteria

2.1 Data Quality Rule Import

Based on the contents of the data dictionaries and the key variables being used for the analysis (permanent facility id), I have created a rule dictionary with instructions to be placed into the rule engine.

This first pass is to run through the basic data quality filters - not study design filters. Any observations that cannot be used for this analysis (e.g. enhanced de-identification redacting facility information, missing key case-mix identifier like age group) will be filtered through this stage. The engine will count the number of rows in each run’s dataset and log it, as it passes the data onto the next rule.

All rule dictionaries will have the same structure, so they can be fed through the same update/logging engine downstream, this pass uses the lu filters rule dictionary.

Dictionaries have the following structure:

  • stage: This indicates which rules should be applied - I am likely keeping these dictionaries in separate tabs, but I wanted to make this filter in case I stack it later.
  • rule_id: Unique identifier for the rules for QA/querying
  • rule_name: Unique legible identifier for QA
  • rule condition: the logic that will be used to identify which rows the rules are applied to.
  • rule_outcome: For filtering, all rules apply exclude, but downstream and for sensitivity analysis, the rule engine may be used for partitioning the dataset or other outcomes - this will be used to determine that process.
  • applies: List of which runs to apply the rule to - for example, the 2016 and 2017 datasets have the abortion_edit_indicator for enhanced de-identification, while the 2018 and 2019 datasets do not have this indicator, and identifying redactions must be done by searching for “Redacted for Confidentiality” within facility_name.
  • order: Rules are applied sequentially based on this column -logged_as This is the clean string used to flag what the transformation was for the log table, to be formatted for final tables and outputs.
Show code
#   Import LU
#--------------------------------------------------------------------------------------------------
#   Set the stage here with the variable stage - this step is currently the data_quality pass, so
#   it's going to pull in the rules flagged with "data_quality" - in this case, from tab name.
#   Store it to data quality, again. If we want to import multiple LUs  - add tab names to 
#   the list.
#--------------------------------------------------------------------------------------------------

#--------------------------------------------------------------------------------------------------
## Dictionary Path:
dict <- "dictionaries/508_Final Project_filters_AMossawir.xlsx"
stage <- c("data_quality")
#--------------------------------------------------------------------------------------------------
 
##create a dictionary list of sheets read from excel, loop over stages, and add them to dicts.
temp_dicts <- lapply(stage, function(row){
  cat(paste0("\nLoading dictionary from:", as.character(dict), "\nRead sheet: ", as.character(row)))
  read_excel(dict)
})

Loading dictionary from:dictionaries/508_Final Project_filters_AMossawir.xlsx
Read sheet: data_quality
Show code
temp_dicts <- setNames(temp_dicts, paste0("lu_", stage))

dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)
## drop temp table
rm(temp_dicts, dict)

##Print LU for Reference
dicts$lu_data_quality %>%
kable(
    caption = "Rules Lookup for Filtering"
  ) 
Rules Lookup for Filtering
stage rule_id rule_name rule_condition rule_outcome output_prefix applies order logged_as
data_quality 1 missing_pfi is.na(v_pfi) exclude a hosp 1 Missing Facility ID
data_quality 2 missing_outcome is.na(v_admissions_totalnew) exclude a hosp 2 Missing Admissions
data_quality 3 outside_window v_date < window_start | v_date > window_end exclude a hosp 3 Outside Analysis Window
data_quality 4 missing_epaid is.na(v_epaid) exclude a waste 1 Missing Facility ID
data_quality 5 wwtp_only v_samplelocation != “wwtp” exclude a waste 2 Exclude non-WWTP samples
data_quality 6 influent_only v_samplematrix != “raw wastewater” exclude a waste 3 Exclude non-Influent samples
data_quality 7 qc_failed flag_qc != “no” exclude a waste 4 Sample failed quality control
data_quality 8 no_template_control_amplify flag_ntcamplify == TRUE exclude a waste 5 Sample failed quality control
data_quality 9 missing_concentration is.na(v_conc) & (is.na(flag_belowlod) | flag_belowlod == FALSE) exclude a waste 6 Missing Concentration
data_quality 10 outside_window v_date < window_start | v_date > window_end exclude a waste 7 Outside Analysis Window

2.2. Data Quality Rules Pass - Helper Functions

The structure of the engine will first loop through over run_id using lu_run, this is the exterior loop. For each run, the engine will first determine which rules apply applies.

Then it will loop over the rules in an interior loop. For each rule, it will log iniitial nrow, and save to audit log. Then it will either apply the rule filtering or skip over it. It will output final nrows to the audit log, and pass the revised data to the next step of the interior loop. It proceeds sequentially over the rows, so working data will be updated once the rule has been executed and logged.

The inner loop passes the final audit lists and the final w_data back to the outer loop. The outer loop saves the audit log and the data to columns in the lookup.

2.2.1. Cached Locations

Show code
#   Cache Info and Location
#--------------------------------------------------------------------------------------------------
#     Workflow will check project file for cached data - this is write-only, just to keep records.
#--------------------------------------------------------------------------------------------------
## File cached data saved to

lu_run[[paste0("cache_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,".rds"))
print(lu_run[[paste0("cache_", stage)]])
[1] "data/data_quality/hosp.rds"  "data/data_quality/waste.rds"
Show code
## File Timestamp saved to
lu_run[[paste0("cache_ts_", stage)]]<- file.path(paste0("data/",stage,"/",lu_run$run_id,"_ts.txt"))
print(lu_run[[paste0("cache_ts_", stage)]])
[1] "data/data_quality/hosp_ts.txt"  "data/data_quality/waste_ts.txt"

2.2.2. flag_rule_application

This function is called at a run_id level within the outer loop. Once the working lookup is selected, it loops over the rule list to determine if the rule is applicable to this particular run. i.e: a rule with the applies text “puf_2016, puf_2017” will be split into a vector and evaluated - it will only be applied when the run_id is in that vector.

Show code
#Function to Determine Rule Applicability - Within Outer Loop
#--------------------------------------------------------------------------------------------------
#   Within outer loop, looking at the current run_id that is being passed, 
#    and determine which rules should be applied.
#--------------------------------------------------------------------------------------------------

flag_rule_application <- function(rule, run_id) {
  applies_vec <- rule$applies_rule
  if ("all" %in% applies_vec) { TRUE } else if (run_id %in% applies_vec) { TRUE } else { FALSE }
}

2.2.3. apply_exclude

This function is called at a rule level within the inner loop. Because the lookup stores the logic for what columns should be excluded as a string, it needs to be unquoted to be properly evaluated as logic. This applies that logic with a filter function from dplyr in order to exclude anything that fits that criteria.

Show code
#Function Within Inner Loop - Actually Exclude
#--------------------------------------------------------------------------------------------------
#   Applies filter on the dataset using rule expression - called within the inner loop
#--------------------------------------------------------------------------------------------------

##function to if it excludes, does it
apply_exclude <- function(w_data, rule_expr) {
  w_data %>% filter(!(!!rule_expr))
}

2.2.4. apply_rules

This function is called on each specific rule list within the inner loop. It takes a single rule from the inner loop and applies it to the w_data working data set. It also generates a list of values to output to an audit table so all transformations (or skipped rules) can be tracked. This audit row lists the initial number of rows, and the final number of rows after transformation, and reports the number of observations affected by each rule.

This is the core of the inner loop. It returns transformed w_data and a row of the audit table.

Show code
#--------------------------------------------------------------------------------------------------
#   Function to Apply Rules - Within Inner Loop
#--------------------------------------------------------------------------------------------------
#   This takes the working dictionary and working data from the interior loop, and continues,
#   passing run_id from the outer loop. It always references [[1]] because it will be passed
#   rowwise. Outputs the updated dataset and a new list, audit_log.
#--------------------------------------------------------------------------------------------------
apply_rule <- function(rule, w_data, run_id) {
##Prelim Information  
  run_flag <- rule$flag_rule_applies
#Generate Audit Row
audit_row <- list(
  stage         = rule$stage,
  run_id        = run_id,
  rule_id       = rule$rule_id,
  rule_name     = rule$rule_name,
  rule_condition= rule$rule_condition,
  rule_outcome  = rule$rule_outcome,
  n_initial     = nrow(w_data),
  logged_as     = rule$logged_as,
  skipped       = !run_flag,
  applied       = run_flag
)
##     cat ( paste0("Run Id: ", string(run_id),   " Rule Name ",rule$rule_name), " Run flag: ", run_flag, " Class W_Data: ",  class(w_data))
## Apply dplyr rules based on what rules text is - this can be updated when I add others. But.
if (run_flag == FALSE) {
  # do nothing
} else if (rule$rule_outcome == "exclude") {
  rule_expr <- parse_expr(rule$rule_condition)
  w_data <- apply_exclude(w_data, rule_expr)
}
audit_row$n_final  <- nrow(w_data)
audit_row$n_change <- audit_row$n_initial - audit_row$n_final

audit_row$timestamp <- Sys.time()
  
   return(list(w_data = w_data, 
              audit_row = audit_row))
         
} 

2.2.5. outer_pass

This function is the core of the outer loop. It is passed a single row from lu_run, and uses that information to call the relevant objects and manage the inner loop.

While it only takes a row from lu_run as a direct argument, it references the following global variables to preserve clarity and naming:

  • stage: This is used to specify throughout that we are on the data quality filtering step. This allows future steps to be built off the same engine.
  • stage_prefix: This is used to specify the output column name’s prefix - in this case, it is ‘a2’
  • prev_col: This is used to specify the source column name - in this case it is ‘a_data’.
Show code
#Function Within Outer Loop - Manage Inner Loop
#--------------------------------------------------------------------------------------------------
#   This is called by the outer loop and is used to manage the inner loop. It passes both
#   transformed data and the audit log back.
#--------------------------------------------------------------------------------------------------

outer_pass <- function(lu_row) {
      
  ##Get that run_id, data, dict to keep passing through  
    run_id <- lu_row$run_id
  # the whole reason this loop wasn't working is because this a_data is apparently. for some reason. a list length 1
    ## with a tibble at that first place. 
    w_data <- lu_row[[prev_col]][[1]]
    w_dict <- dicts[[paste0("lu_",stage)]]
    
    ## Convert rules from a dataframe into list of lists
    rules <- apply(w_dict, 1, as.list)

    for (j in seq_along(rules)) {
      rule <- rules[[j]]
      ##Applies is currently a vector hand-written in excel
        ##First remove all whitespace, then split out what remains
              ## It's gonna put the list into a list of length one with a vector in it.
              #you literally just want the vector.
      rule$applies_rule <- strsplit(gsub("\\s", "",rule$applies), ",")[[1]]
      
      ##We're looping through all the rules rn
      rule$flag_rule_applies <- flag_rule_application(rule, run_id)
      ## and saving the new items back into rules   
       rules[[j]] <- rule
          }

  ## Create an empty list object so that we can store the audit as we go.
    audit <- list()
  ##i did this just counting first but seq_along actually passes info along each
    for (j in seq_along(rules)) {
    ##pull out list j from dictionary
      rule <- rules[[j]]
##Call function apply_rule
##    Apply rule input:
##        rule = one single row (but as list) of dictionary
##        w_data = working dataset
##        run_id = the same for everything in this inner loop
##    Apply rule output:
##        a list of: w_data, audit_row    
      result <- apply_rule(rule, w_data, run_id)
      ## replace w_data with the new w_data saved in result returned list
      w_data <- result$w_data
      ##store 
      audit[[j]] <- result$audit_row
  
      
      }
    list(
    data = w_data,
    audit = audit
  )

}

2.3. Execute Function Calls to Clean Data

The functions above will refer to global stage, prev_col, stage_prefix variables. This is to allow for flexibility of the pipeline. prev_col indicates where the data for this stage should be sourced from, and stage_prefix indicates what the column name of the destination should be. I will revise these away from using the global variable approach when I have more time.

Show code
#--------------------------------------------------------------------------------------------------
## Stage Variable
stage <- "data_quality"
stage_prefix <- "a2_"

## Column Data Source
prev_col <- "a_data"


#--------------------------------------------------------------------------------------------------

newcolname <- paste0(stage_prefix, "data")

#--------------------------------------------------------------------------------------------------
results <- vector("list", nrow(lu_run))

#--------------------------------------------------------------------------------------------------

for (i in seq_len(nrow(lu_run))) {
  ##Actually run the nested loops here, saving to a new vector
  results[[i]] <- outer_pass(lu_run[i, ])
  
  ## Save the outputted day in the specified new column
  lu_run[[newcolname]][[i]] <- results[[i]]$data
  ##save audit log to a table
  lu_run$audit[[i]] <- bind_rows(results[[i]]$audit)
  ## Cache this
  saveRDS(results[[i]]$data,lu_run$cache_data_quality[[i]])
  ##Make a timestamp
  ts <- strftime(Sys.time())
  
  ##save the timestamp
  cat (ts, file = lu_run$cache_ts_data_quality[[i]])
}

rm(results, i , stage, stage_prefix, newcolname, ts, prev_col)

2.4. Output final analytical sample

Analytical Sample Size
Hospitalizations Wastewater
Initial Sample Size 326,349 33,970
Missing Facility ID 0 0
Missing Admissions 0 0
Outside Analysis Window 0 0
Exclude non-WWTP samples 0 537
Exclude non-Influent samples 0 1,010
Sample failed quality control 0 4,283
Missing Concentration 0 0
Final Analytical Sample 326,349 28,140

Part 3. Recoding And Restructuring

3.1. Import Lookups

3.1.1. Hospital Address Lookup

General Facility Information data sourced from DOH at this link: https://health.data.ny.gov/Health/Health-Facility-General-Information/vn5v-hh5r/about_data

3.1.2. Sewershed Shapefiles and Metadata

Shapefiles and metadata sewershed sourced from this link: https://github.com/nys-wwsn/ny-sewer-spatial-data

Show code
##==================================================================================================
## SEWERSHED LOOKUPS
##--------------------------------------------------------------------------------------------------
## Builds three lookup tables used downstream by the recoding pipeline:
##   - lu_facilities : PFI → name, lat, long for every NYS hospital with geocoding
##   - lu_sewer      : one row per WWTP sewershed (epaid keyed) with polygon geometry  
##   - lu_pfi        : analysis-set hospitals spatially joined to their sewershed
##
## The spatial join — assigning each hospital point to the sewershed polygon it falls in — is
## the GeoAI core of the project. Wastewater concentrations are measured at the WWTP, so to pair
## a community signal with hospital admissions we need to know which sewershed each hospital
## drains to.
##==================================================================================================

##--------------------------------------------------------------------------------------------------
## 1. Hospital facility lookup (PFI → coordinates)
##--------------------------------------------------------------------------------------------------
## Source: NYS Health Facility General Information, snapshotted 2026-04-29.
## Provides authoritative lat/long for currently-operating facilities. PFIs are zero-padded to
## width 6 to match the format used in the hospitalizations data.
temp_pfi <-  vroom(file.path("dictionaries/Health_Facility_General_Information_20260429.csv"))
temp_pfi <- temp_pfi |>
  select(v_pfi =`Facility ID`,
         name = `Facility Name`,
         lat = `Facility Latitude`,
         long = `Facility Longitude`) |>
  mutate(v_pfi = formatC(v_pfi, width = 6, flag = "0", format = "d"))

##--------------------------------------------------------------------------------------------------
## 2. Sewershed metadata + polygon geometry
##--------------------------------------------------------------------------------------------------
## Two files from the nys-wwsn/ny-sewer-spatial-data GitHub repo:
##   - nys-wws-sewersheds.csv         : metadata (epaid, wwtp_name, population_served, sample 
##                                      location and type, etc.)
##   - nys-wws-sewersheds-polygons.csv: WKT polygon geometries
## Join key is implicit (cdc_id / sw_id pair); left_join() resolves it.
temp_sewer <- vroom(file.path("dictionaries/nys-wws-sewersheds.csv"))

## Polygon file ships without a CRS embedded — the repo README specifies WGS84, so we set 
## EPSG:4326 explicitly. Bbox sanity-checked against NY's footprint (xmin ≈ -79.7, ymax ≈ 45.0).
temp_shed <- st_read(file.path("dictionaries/nys-wws-sewersheds-polygons.csv"))
Reading layer `nys-wws-sewersheds-polygons' from data source 
  `C:\Users\alexm\OneDrive\Desktop\AGOG\Final Project\GeoAI Final Project\dictionaries\nys-wws-sewersheds-polygons.csv' 
  using driver `CSV'
Simple feature collection with 670 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -79.72533 ymin: 40.49584 xmax: -72.29054 ymax: 45.01053
CRS:           NA
Show code
temp_shed = st_set_crs(temp_shed, 4326)

## Attach metadata to polygons. st_make_valid() handles the inevitable self-intersection edges
## that S2 (sf's default geometry engine on lat/lon) refuses to process.
temp_sewer <-  left_join(temp_shed, temp_sewer)
temp_sewer <- st_make_valid(temp_sewer)

##--------------------------------------------------------------------------------------------------
## 2a. Filter to WWTP-level samples and dedupe
##--------------------------------------------------------------------------------------------------
## NY's wastewater network samples at multiple points per plant: "wwtp" (influent at the
## treatment plant — the catchment-wide signal we want), and "upstream" (sub-catchment manhole
## sampling, useful but represents a smaller nested polygon). We keep only "wwtp" because our
## wastewater concentration data corresponds to plant-level measurements.
##
## After filtering there's still 2x duplication of every row from a quirk in how the metadata
## file is structured — distinct(epaid) collapses to one row per sewershed.
##
## Then a chain of geometry sanitization to deal with self-overlapping multipolygons:
##   st_make_valid()                                  - fix invalid edges
##   st_union(geometry, by_feature = TRUE)            - dissolve internal overlaps within each
##                                                      multipolygon (some sewersheds have
##                                                      MULTIPOLYGONs whose constituent parts
##                                                      overlap each other)
##   st_make_valid()                                  - re-validate (st_union can produce
##                                                      invalid output on gnarly inputs)
temp_sewer <- temp_sewer |>
  filter(sample_location == "wwtp") |>
  distinct(epaid, .keep_all = TRUE) |>
  st_make_valid() |>
  mutate(geometry = st_union(geometry, by_feature = TRUE)) |>
  st_make_valid()

## Belt-and-suspenders: a few polygons still had internal self-overlap that survived the above.
## st_buffer(0) is the classic GIS folk remedy — forces GEOS to rebuild the geometry from
## scratch, which usually resolves topology issues that st_make_valid alone won't catch. Using
## planar GEOS (sf_use_s2(FALSE)) for this op because S2 doesn't support zero-buffer; switching
## back to S2 immediately after.
sf_use_s2(FALSE)
temp_sewer <- temp_sewer |>
  st_buffer(0) |>
  st_make_valid()
sf_use_s2(TRUE)

##--------------------------------------------------------------------------------------------------
## 3. Hand-coded facility additions for closed/merged hospitals
##--------------------------------------------------------------------------------------------------
## Eight hospitals are absent from the current Health Facility General Information dataset
## because they've closed, merged, or been absorbed since the last refresh — but they appear in
## the historical hospitalizations data and represent stable community catchments. Coordinates
## sourced manually from Google Maps; addresses verified against AHA / Healthgrades records.
##
## (Seven temporary COVID-19 surge facilities — PFIs 080001–080007, including Javits Center and
## USNS Comfort — are excluded upstream as non-catchment overflow facilities, not added here.)
manual_facilities <- tribble(
  ~v_pfi,    ~name,                                                            ~lat,             ~long,              ~v_address,
  "000325",  "THE UNIVERSITY OF VERMONT HEALTH NETWORK - ALICE HYDE MEDICAL CENTER", 44.85696878568751,  -74.29148442883366,  "133 Park St, Malone, NY 12953",
  "000565",  "EASTERN NIAGARA HOSPITAL - LOCKPORT DIVISION",                         43.17766479149377,  -78.67168996117513,  "521 East Ave, Lockport, NY 14094",
  "000798",  "CLAXTON-HEPBURN MEDICAL CENTER",                                       44.69286484859764,  -75.50025832557549,  "214 King St, Ogdensburg, NY 13669",
  "001153",  "WYOMING COUNTY COMMUNITY HOSPITAL",                                    42.75436489510769,  -78.13098270352755,  "400 N Main St, Warsaw, NY 14569",
  "001439",  "MOUNT SINAI BETH ISRAEL",                                              40.73865113151168,  -73.98243423411715,  "281 First Ave, New York, NY 10003",
  "000413",  "STRONG MEMORIAL HOSPITAL",                                       43.12161,         -77.62862,          "601 Elmwood Ave, Rochester, NY 14642",
  "000583",  "MOUNT ST MARYS HOSPITAL AND HEALTH CENTER",                      43.17270,         -79.03670,          "5300 Military Rd, Lewiston, NY 14092",
  "001464",  "NEW YORK-PRESBYTERIAN HOSPITAL - COLUMBIA PRESBYTERIAN CENTER",  40.84072,         -73.94192,          "622 W 168th St, New York, NY 10032"
)

##--------------------------------------------------------------------------------------------------
## 4. Restrict to the analysis universe and project to sf
##--------------------------------------------------------------------------------------------------
## We only need facility geocoding for PFIs that actually appear in the hospitalizations data —
## ~200 of NY's ~5,000 listed facilities. Filtering early keeps the spatial join fast and 
## reduces noise in the duplicate audit.
pfis_in_data <- lu_run$a2_data[[which(lu_run$run_id == "hosp")]] |> distinct(v_pfi) |> pull(v_pfi)

## Combine the dataset's geocoded facilities with the manual additions, then filter to the
## analysis universe. Drop facilities with missing longitude (rare, but they'd break st_as_sf).
lu_pfi_complete <- bind_rows(filter(temp_pfi, !is.na(temp_pfi$long)), manual_facilities) |>
  filter(v_pfi %in% pfis_in_data)

## Promote to an sf points object in WGS84.
lu_pfi_complete <- st_as_sf(lu_pfi_complete, coords = c("long", "lat"), crs = 4326)

##--------------------------------------------------------------------------------------------------
## 5. Spatial join: assign each hospital to its sewershed
##--------------------------------------------------------------------------------------------------
## Standard point-in-polygon. left = TRUE so unmatched hospitals (rural facilities outside any
## monitored sewershed — typically on septic systems, not connected to municipal sewer) survive
## the join with NA sewershed fields rather than being silently dropped. They become a 
## limitations-section bullet rather than vanishing.
lu_pfi_complete<- st_join(
  lu_pfi_complete,
  temp_sewer,
  join = st_intersects,
  left = TRUE
)

## A handful of sewershed multipolygons have internal self-overlap that survived every geometry
## sanitization above (st_make_valid, st_union, st_buffer(0)) — the geometries are gnarly enough
## that st_intersects() returns the same epaid twice for a single hospital point. The duplicate
## rows carry identical sewershed assignment, so collapsing post-join loses no information.
lu_pfi_complete <- lu_pfi_complete |>
  distinct(v_pfi, .keep_all = TRUE)

##--------------------------------------------------------------------------------------------------
## 6. Stash into the dicts list and clean up
##--------------------------------------------------------------------------------------------------
temp_dicts <- list(lu_sewer = temp_sewer, lu_facilities = temp_pfi, lu_pfi = lu_pfi_complete)
dicts[names(temp_dicts)] <- NULL
dicts <- c(dicts, temp_dicts)

## drop temp tables
rm(temp_dicts, temp_pfi, temp_sewer, temp_shed, lu_pfi_complete)

3.2. Recode WWTP Data

Show code
ww <- lu_run$a2_data[[which(lu_run$run_id == "waste")]]
lu_sewer <- dicts$lu_sewer

ww <- ww |>
  mutate(
    flag_belowlod_derived = (v_conc == 0) | (v_conc < v_lodsewage),
    vp_conc               = if_else(flag_belowlod_derived, v_lodsewage / 2, v_conc),
    vp_conc_log           = log1p(vp_conc),
  ) |>
  select(v_epaid, v_date, vp_conc, vp_conc_log, flag_belowlod_derived,
         v_totalpop, v_wwtp)


sg <- nrow(ww)
ww <- left_join(ww, lu_sewer, by = c("v_epaid" = "epaid"))
if (!(nrow(ww) == sg)) {
  stop("join duping")
}
n_unmatched <- sum(is.na(ww$WKT))
if (n_unmatched > 0) {
  warning(sprintf("%d WW rows had no sewershed match", n_unmatched))
}

map_ww <- ww |>
  st_drop_geometry() |>
  count(v_epaid, WKT) |>
  st_as_sf(wkt = "WKT", crs = 4326)

lu_run$a3_data[[which(lu_run$run_id == "waste")]] <- ww

rm(ww, lu_sewer)

3.2.2 Map of Log-Transformed Sample Count by Sewesehd

Show code
map_ny_counties = counties(state = "NY", cb = TRUE, progress_bar = FALSE)

ggplot() +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey") +
  geom_sf(data = map_ww, aes(fill = n), linewidth = 0.2,, alpha = 0.8) +
  scale_fill_viridis_c(trans = "log10") +
  theme_classic() +
  labs(title = "Sample count (log-transformed) by sewershed") + 
       theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

3.3. Recode Hospital Data

Show code
## Grab Hospital Location from Dicts

lu_pfi <- dicts$lu_pfi
lu_sewer <- dicts$lu_sewer


hosp <- lu_run$a2_data[[which(lu_run$run_id == "hosp")]]
hosp <-  left_join(hosp, lu_pfi)

qa_excluded_hospitals <- hosp |>
  filter(is.na(epaid)) 


map_excluded_sf <- hosp |>
  filter(is.na(epaid)) |>
  ## Also filtering out surge centers
  filter(!str_starts(v_pfi, "080")) |>
  distinct(v_pfi, v_facility, geometry) |>
  st_as_sf()

hosp <- hosp |>
  filter(!is.na(epaid)) |>
  filter(!str_starts(v_pfi, "080"))

lu_run$a3_data[[which(lu_run$run_id == "hosp")]] <- hosp 


rm(hosp, lu_pfi)

3.3.1 Table of all Hospitals Excluded from Analysis

Show code
### List of Excluded Hospitals
qa_excluded_hospitals <- qa_excluded_hospitals |> 
  group_by(v_pfi, v_facility) |>
  summarize(
    n = n(),
    v_currentpatients = sum(v_currentpatients),
    v_patients_icu = sum(v_patients_icu),
    v_admissions_totalnew = sum(v_admissions_totalnew),
    min = min(v_date),
    max = max(v_date))

qa_excluded_hospitals|>
  kable()
v_pfi v_facility n v_currentpatients v_patients_icu v_admissions_totalnew min max
000303 THE UNIVERSITY OF VERMONT HEALTH NETWORK - ELIZABETHTOWN COMMUNITY HOSPITAL 1630 1524 0 241 2020-03-26 2025-10-06
000541 NORTH SHORE UNIVERSITY HOSPITAL 1630 95218 16697 8799 2020-03-26 2025-10-06
000817 CLIFTON-FINE HOSPITAL 1630 241 0 45 2020-03-26 2025-10-06
000873 IRA DAVENPORT MEMORIAL HOSPITAL 1630 576 0 152 2020-03-26 2025-10-06
000885 LONG ISLAND COMMUNITY HOSPITAL 1630 25049 5730 3653 2020-03-26 2025-10-06
000889 UNIVERSITY HOSPITAL - STONY BROOK SOUTHAMPTON HOSPITAL 1630 9418 1976 1548 2020-03-26 2025-10-06
000896 ST. CHARLES HOSPITAL 1630 13159 3215 2306 2020-03-26 2025-10-06
000943 ST CATHERINE OF SIENA HOSPITAL 1630 25153 4122 4001 2020-03-26 2025-10-06
000968 CATSKILL REGIONAL MEDICAL CENTER - G. HERMANN SITE 1630 139 0 31 2020-03-26 2025-10-06
001002 ELLENVILLE REGIONAL HOSPITAL 1630 2742 0 336 2020-03-26 2025-10-06
003067 MILLARD FILLMORE SUBURBAN HOSPITAL 1630 32373 4931 5462 2020-03-26 2025-10-06
080001 JAVITS CENTER HOSPITAL 1624 4877 12 0 2020-04-01 2025-10-06
080002 USNS COMFORT HOSPITAL 25 516 0 0 2020-04-03 2020-04-27
080003 MOUNT SINAI - SAMARITANS PURSE 1609 499 12 0 2020-04-16 2025-10-06
080004 NYC H+H BILLIE JEAN KING TENNIS CENTER 1609 446 0 0 2020-04-16 2025-10-06
080005 NYC H+H ROOSEVELT ISLAND MEDICAL CENTER 1608 2502 0 0 2020-04-17 2025-10-06
080006 MAIMONIDES CROWN HEIGHTS 1608 418 0 0 2020-04-17 2025-10-06
080007 NYP - RYAN LARKIN FIELD HOSPITAL 1604 451 0 31 2020-04-21 2025-10-06

3.3.2 Map of Hospitals Outside Sewersheds

Show code
map_lu_sewer <- dicts$lu_sewer
ggplot() +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey")+
  geom_sf(data = map_lu_sewer, fill = "lightblue", color = "white", linewidth = 0.2, alpha = 0.8) +
  geom_sf(data = map_excluded_sf, color = "red", size = 2, alpha = 0.7) +
  theme_classic() +
  labs(title = "Hospitals excluded from sewershed analysis",
       subtitle = paste0(nrow(map_excluded_sf), " facilities outside any monitored sewershed"))+
  theme(axis.text = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

Show code
rm(lu_sewer)

Part 4. Aggregating Weekly and Merging Inputs

Show code
week_start_dow <- 1  # Monday start
window_model_start <- "2023-01-01"
window_model_end <- "2024-12-31"

4.1. Assessing Data Distribution Prior to Aggregation

4.1.1. Visual Inspection of Distribution of Data Across Weekdays

Show code
ww <- lu_run$a3_data[[which(lu_run$run_id == "waste")]]

qa_ww_cadence <- ww %>%   # swap in whatever your post-filter WW df is called
  filter(!is.na(vp_conc)) %>%
  mutate(
    weekday = wday(v_date, label = TRUE, week_start = 1),  # Mon-start, ordered
    year    = year(v_date)
  )

ggplot(qa_ww_cadence, aes(x = weekday)) +
  geom_bar() +
  facet_wrap(~ year, ncol = 2) +
  labs(
    title    = "WW sampling cadence by weekday",
    subtitle = "Populated v_conc records, faceted by calendar year",
    x = NULL, y = "Records"
  ) +
  theme_minimal()

Show code
rm(ww)

4.1.2. Visual Inspection of Distribution of Data Across Weeks

Show code
qa_ww_cadence %>%
  mutate(iso_week = floor_date(v_date, "week", week_start = 1)) %>%
  count(iso_week) %>%
  ggplot(aes(iso_week, n)) +
  geom_col() +
  labs(title = "Total WW samples per ISO week",
       x = NULL, y = "Samples") +
  theme_minimal()

Show code
qa_ww_cadence %>%
  mutate(iso_week = floor_date(v_date, "week", week_start = 1)) %>%
  count(v_epaid, iso_week) %>%
  group_by(v_epaid) %>%
  mutate(total_n = sum(n)) %>%
  ungroup() %>%
  mutate(v_epaid = forcats::fct_reorder(v_epaid, total_n)) %>%
  ggplot(aes(iso_week, v_epaid, fill = n)) +
  geom_tile() +
  scale_fill_viridis_c() +
  labs(title = "WW sampling coverage by sewershed × week",
       subtitle = "Sorted by total sample count",
       x = NULL, y = "Sewershed") +
  theme_minimal() +
  theme(axis.text.y = element_blank())

4.2 Aggregating Weekly-Level Hospital Data

Show code
hosp <- lu_run$a3_data[[which(lu_run$run_id == "hosp")]]

# Two-step because admissions is a flow but census/ICU are stocks
# week_start set in params at beginning.
hosp_weekly <- hosp |>
  mutate(week = floor_date(v_date, "week", week_start = week_start_dow)
         ) |>
  # Step 1: per facility per week — sum admissions (flow), mean census/ICU (stock)
  group_by(sw_id, v_pfi, week) |>
  summarize(
    admissions_fac = sum(v_admissions_totalnew, na.rm = TRUE),
    census_fac     = mean(v_currentpatients,    na.rm = TRUE),
    icu_fac        = mean(v_patients_icu,       na.rm = TRUE),
    .groups = "drop"
  ) |>
  # Step 2: roll facilities up to sewershed-week — sum across facilities
  group_by(sw_id, week) |>
  summarize(
    v_admissions_totalnew = sum(admissions_fac, na.rm = TRUE),
    v_currentpatients     = sum(census_fac,     na.rm = TRUE),
    v_patients_icu        = sum(icu_fac,        na.rm = TRUE),
    .groups = "drop"
  )

rm(hosp)

head(hosp_weekly) |>
  kable()
sw_id week v_admissions_totalnew v_currentpatients v_patients_icu
36001NY0026867AL11 2020-03-23 0 26.25000 11.00000
36001NY0026867AL11 2020-03-30 0 71.00000 29.00000
36001NY0026867AL11 2020-04-06 0 115.14286 55.28571
36001NY0026867AL11 2020-04-13 0 93.42857 40.57143
36001NY0026867AL11 2020-04-20 0 85.71429 30.00000
36001NY0026867AL11 2020-04-27 8 84.57143 31.00000

4.3 Aggregating Weekly-Level WW Data

Show code
waste <- lu_run$a3_data[[which(lu_run$run_id == "waste")]]

# ---- WW weekly aggregation ----
waste_weekly <- waste |>
  mutate(week = floor_date(v_date, "week", week_start = week_start_dow)) |>
  group_by(sw_id, week) |>
  summarize(
    vp_conc_log_mean = mean(vp_conc_log, na.rm = TRUE),
    n_samples        = n(),
    frac_detects     = mean(!flag_belowlod_derived, na.rm = TRUE),
    .groups = "drop"
  )

rm(waste)

head(waste_weekly) |>
  kable()
sw_id week vp_conc_log_mean n_samples frac_detects
36001NY0022225ACWWF 2023-02-06 11.080618 1 1
36001NY0022225ACWWF 2023-02-13 5.993961 1 0
36001NY0022225ACWWF 2023-02-20 11.522886 1 1
36001NY0022225ACWWF 2023-02-27 8.888895 1 1
36001NY0022225ACWWF 2023-03-06 10.555839 1 1
36001NY0022225ACWWF 2023-03-13 5.993961 1 0

4.4 Generate panel skeleton from sewershed lookup

Show code
lu_sewer <- dicts$lu_sewer

all_swsheds <- unique(lu_sewer$sw_id)  # adjust to your actual ID col
panel_window_start <- as.Date("2020-09-01")
panel_window_end   <- as.Date("2024-12-31")

all_weeks <- seq.Date(
  from = floor_date(panel_window_start, "week", week_start = week_start_dow),
  to   = floor_date(panel_window_end,   "week", week_start = week_start_dow),
  by   = "week"
)
panel_skeleton <- expand_grid(
  sw_id = all_swsheds,
  week = all_weeks
)
rm(lu_sewer)


head(panel_skeleton)
# A tibble: 6 × 2
  sw_id               week      
  <chr>               <date>    
1 36011NY0021903CCWW1 2020-08-31
2 36011NY0021903CCWW1 2020-09-07
3 36011NY0021903CCWW1 2020-09-14
4 36011NY0021903CCWW1 2020-09-21
5 36011NY0021903CCWW1 2020-09-28
6 36011NY0021903CCWW1 2020-10-05

4.5 Join weekly hospital and wwtp data to panel - Mask Missing vals

Show code
# ---- Join ----
panel <- panel_skeleton |>
  left_join(waste_weekly,   by = c("sw_id", "week")) |>
  left_join(hosp_weekly, by = c("sw_id", "week"))

panel <- panel |>
  mutate(ww_missing_mask = as.integer(is.na(vp_conc_log_mean)),
         hosp_missing_mask = as.integer(is.na(v_admissions_totalnew)),
)
 
rm(panel_skeleton, waste_weekly, hosp_weekly)

head(panel)
# A tibble: 6 × 10
  sw_id week       vp_conc_log_mean n_samples frac_detects v_admissions_totalnew
  <chr> <date>                <dbl>     <int>        <dbl>                 <dbl>
1 3601… 2020-08-31            NA           NA           NA                     0
2 3601… 2020-09-07            NA           NA           NA                     0
3 3601… 2020-09-14            NA           NA           NA                     1
4 3601… 2020-09-21            NA           NA           NA                     0
5 3601… 2020-09-28            NA           NA           NA                     3
6 3601… 2020-10-05             5.99         1            0                     1
# ℹ 4 more variables: v_currentpatients <dbl>, v_patients_icu <dbl>,
#   ww_missing_mask <int>, hosp_missing_mask <int>

4.6 Forward Fill

Show code
panel_filled <- panel |>
  arrange(sw_id, week) |>
  group_by(sw_id) |>
  fill(v_admissions_totalnew, v_currentpatients, v_patients_icu, vp_conc_log_mean,
       .direction = "down") |>
  ungroup()

4.7. Diagnostics and Visual

4.7.1. Full Period Coverage by Sewershed Summaries

Show code
# Per-sewershed coverage summary
qa_coverage_summary <- panel_filled |>
  group_by(sw_id) |>
  summarize(
    n_weeks            = n(),
    hosp_coverage_pct  = mean(hosp_missing_mask == 0) * 100,
    ww_coverage_pct    = mean(ww_missing_mask == 0)   * 100,
    has_any_ww         = any(ww_missing_mask == 0),
    .groups = "drop"
  ) |>
  arrange(desc(ww_coverage_pct))

thresholds <- c(0.5, 0.6, 0.7, 0.8, 0.9)
qa_threshold_table <- tibble(
  threshold = thresholds,
  hosp_only      = map_int(thresholds, ~ sum(qa_coverage_summary$hosp_coverage_pct >= .x * 100)),
  ww_only        = map_int(thresholds, ~ sum(qa_coverage_summary$ww_coverage_pct   >= .x * 100)),
  hosp_and_ww    = map_int(thresholds, ~ sum(qa_coverage_summary$hosp_coverage_pct >= .x * 100 &
                                              qa_coverage_summary$ww_coverage_pct  >= .x * 100))
)
qa_threshold_table |> kable()
threshold hosp_only ww_only hosp_and_ww
0.5 108 90 54
0.6 108 52 32
0.7 108 18 9
0.8 108 10 5
0.9 108 8 5
Show code
ggplot(qa_coverage_summary, aes(hosp_coverage_pct, ww_coverage_pct)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 70, linetype = "dashed") +
  geom_vline(xintercept = 70, linetype = "dashed") +
  labs(x = "Hosp coverage %", y = "WW coverage %",
       title = "Per-sewershed coverage (each point = one sewershed), full analytical window")

4.7.2. Sewershed coerage by late/early window

Show code
windows <- tribble(
  ~window_label,         ~start,         ~end,
  "Early (2020-09 to 2021-09)", as.Date("2020-09-01"), as.Date("2021-09-01"),
  "Late (2023-01 to 2024-12)",  as.Date("2023-01-01"), as.Date("2024-12-31")
)

qa_coverage_by_window <- windows |>
  pmap_dfr(function(window_label, start, end) {
    panel_filled |>
      filter(week >= start, week <= end) |>
      group_by(sw_id) |>
      summarize(
        hosp_coverage_pct = mean(hosp_missing_mask == 0) * 100,
        ww_coverage_pct   = mean(ww_missing_mask == 0)   * 100,
        .groups = "drop"
      ) |>
      mutate(window_label = window_label)
  })

qa_threshold_table_window <- qa_coverage_by_window |>
  group_by(window_label) |>
  group_modify(~ tibble(
    threshold   = thresholds,
    hosp_only   = map_int(thresholds, \(t) sum(.x$hosp_coverage_pct >= t * 100)),
    ww_only     = map_int(thresholds, \(t) sum(.x$ww_coverage_pct   >= t * 100)),
    hosp_and_ww = map_int(thresholds, \(t) sum(.x$hosp_coverage_pct >= t * 100 &
                                               .x$ww_coverage_pct  >= t * 100))
  )) |>
  ungroup()

qa_threshold_table_window |> kable()
window_label threshold hosp_only ww_only hosp_and_ww
Early (2020-09 to 2021-09) 0.5 108 8 5
Early (2020-09 to 2021-09) 0.6 108 8 5
Early (2020-09 to 2021-09) 0.7 108 7 5
Early (2020-09 to 2021-09) 0.8 108 6 4
Early (2020-09 to 2021-09) 0.9 108 5 3
Late (2023-01 to 2024-12) 0.5 108 138 63
Late (2023-01 to 2024-12) 0.6 108 137 63
Late (2023-01 to 2024-12) 0.7 108 131 62
Late (2023-01 to 2024-12) 0.8 108 113 58
Late (2023-01 to 2024-12) 0.9 108 78 47
Show code
ggplot(qa_coverage_by_window, aes(hosp_coverage_pct, ww_coverage_pct)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 70, linetype = "dashed") +
  geom_vline(xintercept = 70, linetype = "dashed") +
  facet_wrap(~ window_label) +
  labs(x = "Hosp coverage %", y = "WW coverage %",
       title = "Per-sewershed coverage by analysis window")

Based on this plot, Model 1 and 2 will use all eligible sewersheds - filtering to hospitals which have at least 70% coverage within this window.

Model 3.1 and 3.2, which utilize the late window, will filter to sewersheds with both at least 70% WW coverage and at least 70% hospital coverage. M3.1 will be evaluated twice - once with the full list of all eligible sewersheds (at least 70% coverage of hospitals, no WW threshold), and once for apples to apples comparison with M3.2.

4.8. Export Filled Panel and Hospital + Hosp/WW Eligibility Sets

Confirmed that the 108 sewersheds present in the early dataset are the same present in the late dataset, so M1-M3.1.1 will be run on the same “hosp_only” eligibility set, then m3.1.2 and M2.2 will be run on the “hosp_and_ww”.

4.8.1 - Eligibility Lists Written as CSVs

Show code
### Set this threshold - can change but based on the abvoe data, 70% is the sweet spot.
threshold <- 0.7


### Generate hospital only list - set to be Early Window but it's the same in both windows, confirmed
hosp_only <-  qa_coverage_by_window |>
  filter(window_label == "Early (2020-09 to 2021-09)" & 
           hosp_coverage_pct >= threshold
           ) |>
  pull(sw_id)

## Write the CSV
write.csv(hosp_only, file = file.path("data/eligibility/hosp_only.csv"), col.names = FALSE, row.names = FALSE)


## generate late period 70% coverage eligibility lists
hosp_and_ww <-  qa_coverage_by_window |>
  filter(window_label == "Late (2023-01 to 2024-12)" & 
           hosp_coverage_pct >= threshold &
           ww_coverage_pct >= threshold
           ) |>
  pull(sw_id)


## write the CSV
write.csv(hosp_and_ww, file = file.path("data/eligibility/hosp_and_ww.csv"), col.names = FALSE, row.names = FALSE)

4.8.1 - Write Filled Panel as CSVs

Show code
## Write the CSV
write.csv(panel_filled, file = file.path("data/panel/panel_filled.csv"), row.names=FALSE)

Part 5. Transition to Python

5.1. Python Environment Setup

Using Reticulate to work in a Conda environment here that I named “covid-lstm”.

Import torch, pandas, numpy

Show code
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
print(torch.__version__)
2.10.0
Show code
print(pd.__version__)
3.0.2
Show code
print("CUDA available:" , torch.cuda.is_available())
CUDA available: True
Show code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

5.2. Import Data for Modeling

5.2.1. Model Lookup

Show code
##Kable equivalent - passes into HTML
def scrollable(df, height=350):
    from IPython.display import HTML
    style = f"max-height: {height}px; overflow-y: auto; border: 1px solid #ddd;"
    return HTML(f'<div style="{style}">{df.to_html(index=False)}</div>')

lu_models = pd.read_csv("dictionaries/508_Final Project_models_AMossawir.csv",
                       parse_dates=["window_start", "window_end"])

scrollable(lu_models)
model_id stage description window_start window_end features eligibility_set seq_len hidden_size num_layers learning_rate batch_size epochs embedding_dim seed
M1 baseline Hosp-only LSTM, early window, default hparams 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 32 1 0.0010 64 30 8 508
M2_01 sweep Sweep: seq_len=4, hidden=16, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 16 1 0.0100 64 30 8 508
M2_02 sweep Sweep: seq_len=4, hidden=16, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 16 1 0.0010 64 30 8 508
M2_03 sweep Sweep: seq_len=4, hidden=16, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 16 1 0.0001 64 30 8 508
M2_04 sweep Sweep: seq_len=4, hidden=32, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 32 1 0.0100 64 30 8 508
M2_05 sweep Sweep: seq_len=4, hidden=32, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 32 1 0.0010 64 30 8 508
M2_06 sweep Sweep: seq_len=4, hidden=32, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 32 1 0.0001 64 30 8 508
M2_07 sweep Sweep: seq_len=4, hidden=64, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 64 1 0.0100 64 30 8 508
M2_08 sweep Sweep: seq_len=4, hidden=64, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 64 1 0.0010 64 30 8 508
M2_09 sweep Sweep: seq_len=4, hidden=64, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 4 64 1 0.0001 64 30 8 508
M2_10 sweep Sweep: seq_len=8, hidden=16, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 16 1 0.0100 64 30 8 508
M2_11 sweep Sweep: seq_len=8, hidden=16, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 16 1 0.0010 64 30 8 508
M2_12 sweep Sweep: seq_len=8, hidden=16, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 16 1 0.0001 64 30 8 508
M2_13 sweep Sweep: seq_len=8, hidden=32, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 32 1 0.0100 64 30 8 508
M2_14 sweep Sweep: seq_len=8, hidden=32, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 32 1 0.0010 64 30 8 508
M2_15 sweep Sweep: seq_len=8, hidden=32, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 32 1 0.0001 64 30 8 508
M2_16 sweep Sweep: seq_len=8, hidden=64, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 64 1 0.0100 64 30 8 508
M2_17 sweep Sweep: seq_len=8, hidden=64, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 64 1 0.0010 64 30 8 508
M2_18 sweep Sweep: seq_len=8, hidden=64, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 64 1 0.0001 64 30 8 508
M2_19 sweep Sweep: seq_len=12, hidden=16, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 16 1 0.0100 64 30 8 508
M2_20 sweep Sweep: seq_len=12, hidden=16, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 16 1 0.0010 64 30 8 508
M2_21 sweep Sweep: seq_len=12, hidden=16, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 16 1 0.0001 64 30 8 508
M2_22 sweep Sweep: seq_len=12, hidden=32, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 32 1 0.0100 64 30 8 508
M2_23 sweep Sweep: seq_len=12, hidden=32, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 32 1 0.0010 64 30 8 508
M2_24 sweep Sweep: seq_len=12, hidden=32, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 32 1 0.0001 64 30 8 508
M2_25 sweep Sweep: seq_len=12, hidden=64, lr=0.01 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 64 1 0.0100 64 30 8 508
M2_26 sweep Sweep: seq_len=12, hidden=64, lr=0.001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 64 1 0.0010 64 30 8 508
M2_27 sweep Sweep: seq_len=12, hidden=64, lr=0.0001 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 12 64 1 0.0001 64 30 8 508
M2 tuned Hosp-only LSTM, early window, sweep winner (TRANSCRIBE FROM SWEEP) 2020-09-01 2021-09-01 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 32 1 0.0100 64 30 8 508
M3_1_1 transfer Hosp-only LSTM, late window, hospital-full-subsets TBD 2023-01-01 2024-12-31 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_only 8 32 1 0.0100 64 30 8 508
M3_1_2 transfer Hosp-only LSTM, late window, M2 hparams transferred (TRANSCRIBE FROM M2) 2023-01-01 2024-12-31 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask hosp_and_ww 8 32 1 0.0100 64 30 8 508
M3_2 treatment Hosp+WW LSTM, late window, M2 hparams transferred (TRANSCRIBE FROM M2) 2023-01-01 2024-12-31 v_admissions_totalnew,v_currentpatients,v_patients_icu,hosp_missing_mask,vp_conc_log_mean,ww_missing_mask hosp_and_ww 8 32 1 0.0100 64 30 8 508

5.2.2. Eligibility Lists

Show code
## Read both from CSSV
hosp_only = pd.read_csv("data/eligibility/hosp_only.csv")["x"].tolist()
hosp_and_ww = pd.read_csv("data/eligibility/hosp_and_ww.csv")["x"].tolist()

eligibility_lookup = {
    "hosp_only": hosp_only,
    "hosp_and_ww": hosp_and_ww,
}
print("Hospital Only Subset: ", len(hosp_only))
Hospital Only Subset:  108
Show code
print("Hospital and WW Subset: ", len(hosp_and_ww))
Hospital and WW Subset:  69

5.2.3. Panel

Show code
panel = pd.read_csv("data/panel/panel_filled.csv", parse_dates=["week"])

print("Panel Dimensions: ", panel.shape)
Panel Dimensions:  (135292, 10)

Part 6. Modeling Utilities

6.1. Tensor Construction (panel_to_tensors)

Show code
# Creating Time-Series Sequences
# Neural networks require fixed-size input samples.

# To convert the continuous time series into training samples, we use a sliding window approach.

# For each sample:
    #Input:
    #A window containing the past **72 time steps**.

    #Target:
    # The temperature value **6 time steps in the future**.

# This converts the dataset into many input–output pairs suitable for training neural networks.

## This takes panel, single row of lookup, eligibility_lookup
## Horizon = 2 because we're predicting t+2
## Test fract = 0.2, this needs to be handled here because we are dividing this up by date-range for prediction.

def panel_to_tensors(panel_df, config_row, eligibility_lookup, horizon=2, test_frac=0.2):
    feature_cols = config_row["features"].split(",")     ## Grab list of inputs, turn into list
    seq_len = int(config_row["seq_len"])      ## Sequence length is lookback window, analogous to window size
    window_start = pd.to_datetime(config_row["window_start"]) ## Filter model date window
    window_end = pd.to_datetime(config_row["window_end"])    ## filter model date windo
    eligible_sw_ids = eligibility_lookup[config_row["eligibility_set"]]   ## filter this to only the relevant SWs
# 2. Filter panel to eligible sewersheds and analysis window
    p = panel_df[
        panel_df["sw_id"].isin(eligible_sw_ids) &
        (panel_df["week"] >= window_start) &
        (panel_df["week"] <= window_end)
    ].copy()
    p = p.sort_values(["sw_id", "week"])
    
    # 3. Backfill leading NAs in vp_conc_log_mean per sewershed (mask flags imputed cells)
    if "vp_conc_log_mean" in feature_cols:
        p["vp_conc_log_mean"] = (
            p.groupby("sw_id", group_keys=False)["vp_conc_log_mean"].bfill()
        )
    
    # 4. Build sw_id → integer index mapping for embedding layer
    sw_id_to_int = {sw: i for i, sw in enumerate(sorted(p["sw_id"].unique()))}
    
    # 5. Determine chronological train/test cutoff (same date for all sewersheds)
    all_weeks = sorted(p["week"].unique())
    cutoff_idx = int(len(all_weeks) * (1 - test_frac))
    cutoff_date = all_weeks[cutoff_idx]
    
    # 6. Per-sewershed sliding window: build sample triples - This is the same as create_sequence but has sw embed
    X_train, y_train, idx_train = [], [], []
    X_test, y_test, idx_test = [], [], []
    
    ## Looping over each sewershed - SW-loop features
    for sw_id, group in p.groupby("sw_id"):
        group = group.sort_values("week").reset_index(drop=True)
        features = group[feature_cols].values.astype(np.float32)
        ## target always v_admissions
        target = group["v_admissions_totalnew"].values.astype(np.float32)
        weeks = group["week"].values
        
        # Slide window: input = features[i : i+seq_len], target = log(target[i+seq_len+horizon-1])
        for i in range(len(group) - seq_len - horizon + 1):
            X_window = features[i : i + seq_len]
            ## this is because python >:|, -1 to get this to ex. 6 + 2 week window, 9 - 1 because 0 = 1
            y_value = target[i + seq_len + horizon - 1]
            ## this is because python >:|
            target_week = weeks[i + seq_len + horizon - 1]
            
            # Skip if target is NaN (shouldn't happen for hosp, can happen if WW row had no fill possible)
            if np.isnan(y_value) or np.isnan(X_window).any():
                continue
            
            # Log-transform target - count data, heavily skewed
            ## I am not doing poisson even though it's a better match because lab doesn't
            y_log = np.log1p(y_value)
            ## grab sewershed specificity for embedding
            sw_idx = sw_id_to_int[sw_id]
            
            # Assign to train or test by target week
            if target_week < cutoff_date:
                X_train.append(X_window)
                y_train.append(y_log)
                idx_train.append(sw_idx)
            else:
                X_test.append(X_window)
                y_test.append(y_log)
                idx_test.append(sw_idx)
    
    # 7. Stack into torch tensors
    def stack(X_list, y_list, idx_list):
        return (
            torch.tensor(np.array(X_list), dtype=torch.float32),
            torch.tensor(np.array(y_list), dtype=torch.float32),
            torch.tensor(np.array(idx_list), dtype=torch.long),
        )
    
    tensors = {
        "train": stack(X_train, y_train, idx_train),
        "test": stack(X_test, y_test, idx_test),
    }
    
    return tensors, sw_id_to_int

tensors, sw_id_to_int = panel_to_tensors(panel, lu_models.iloc[0], eligibility_lookup)
print("Train X shape:", tensors["train"][0].shape)
Train X shape: torch.Size([3024, 12, 4])
Show code
print("Test X shape:", tensors["test"][0].shape)
Test X shape: torch.Size([1188, 12, 4])
Show code
print("N sewersheds:", len(sw_id_to_int))
N sewersheds: 108

Train + Test = 3456 + 1188 = 4644 108 x (52-8 - 2 + 1(ew))

6.2. Sequence Dataset

Show code
## This is the same as what we use in seq pred lab but include added sw embed
## doesn't need tensor thing because did it before

from torch.utils.data import Dataset, DataLoader

class SequenceDataset(Dataset):
    def __init__(self, X, y, sw_idx):
        self.X = X
        self.y = y
        self.sw_idx = sw_idx

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx], self.sw_idx[idx]
      
X_train, y_train, idx_train = tensors["train"]
ds = SequenceDataset(X_train, y_train, idx_train)
print("len:", len(ds))
len: 3024
Show code
print("first sample shapes:", [t.shape for t in ds[0]])      
first sample shapes: [torch.Size([12, 4]), torch.Size([]), torch.Size([])]

6.3. Model architecture (LSTMRegressor)

Show code
# LSTM Model Architecture: https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html
class LSTMRegressor(nn.Module):
    def __init__(self, input_size, hidden_size, n_sewersheds, num_layers, embedding_dim):
        super().__init__()
        self.lstm = nn.LSTM(input_size=input_size,
                            hidden_size=hidden_size,
                            num_layers=num_layers,
                            batch_first=True)
        # add this to embed sewershed
        self.embedding = nn.Embedding(n_sewersheds, embedding_dim)
        # add embedding dimension
        self.fc = nn.Linear(hidden_size + embedding_dim, 1)

    def forward(self, x, sw_idx):
        out, (h, c) = self.lstm(x)
        last_hidden = out[:, -1, :]
        emb = self.embedding(sw_idx)
        concat = torch.cat([last_hidden, emb], dim=1)
        return self.fc(concat)
      
      
      

config = lu_models.iloc[0]

n_features    = len(config["features"].split(","))
hidden_size   = int(config["hidden_size"])
num_layers    = int(config["num_layers"])
embedding_dim = int(config["embedding_dim"])
n_sewersheds  = len(sw_id_to_int)

model = LSTMRegressor(input_size=n_features, hidden_size=hidden_size,  n_sewersheds=n_sewersheds, embedding_dim=embedding_dim, num_layers=num_layers)

fake_X = torch.zeros(2, int(config["seq_len"]), n_features)
fake_idx = torch.tensor([0, 1], dtype=torch.long)
out = model(fake_X, fake_idx)
print("Output shape:", out.shape)
Output shape: torch.Size([2, 1])
Show code
print("Output values:", out)
Output values: tensor([[-0.4940],
        [-0.5252]], grad_fn=<AddmmBackward0>)

6.4. Training and evaluation helpers

6.4.1. Train One Epoch

Lab function + sw_idx

Show code

def train_one_epoch(model, loader, criterion, optimizer):
    model.train()
    total_loss = 0.0
    ## Added sw_batch
    for X_batch, y_batch, sw_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        ## Added sw_batch
        sw_batch = sw_batch.to(device)

        optimizer.zero_grad()
        ## Added sw_batch - with squeeze because got shape error
        preds = model(X_batch, sw_batch).squeeze(-1)
        loss = criterion(preds, y_batch)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * X_batch.size(0)

    return total_loss / len(loader.dataset)

6.4.2. Evaluate Loss

Lab function + sw_idx

Show code

@torch.no_grad()
def evaluate_loss(model, loader, criterion):
    model.eval()
    total_loss = 0.0

    for X_batch, y_batch, sw_batch in loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        sw_batch = sw_batch.to(device)

        preds = model(X_batch, sw_batch).squeeze(-1)
        loss = criterion(preds, y_batch)

        total_loss += loss.item() * X_batch.size(0)

    return total_loss / len(loader.dataset)

6.4.3. Collect predictions

Lab function + sw_idx

Show code

@torch.no_grad()
def collect_predictions(model, loader):
    model.eval()
    preds_all = []
    targets_all = []
    sw_all = []

    for X_batch, y_batch, sw_batch in loader:
        X_batch = X_batch.to(device)
        sw_batch = sw_batch.to(device)
        preds = model(X_batch, sw_batch).squeeze(-1).cpu().numpy()
        targets = y_batch.numpy()
        preds_all.append(preds)
        targets_all.append(targets)
        sw_all.append(sw_batch.cpu().numpy())

    preds_all = np.concatenate(preds_all)
    targets_all = np.concatenate(targets_all)
    # back-transform from log scale to counts
    preds_all = np.expm1(preds_all)
    targets_all = np.expm1(targets_all)
    sw_all = np.concatenate(sw_all)

    return preds_all, targets_all, sw_all

6.4.4. Plot learning Cureve

Show code

def plot_learning_curve(history, model_name):
    epochs_range = range(1, len(history["train_loss"]) + 1)
    plt.figure(figsize=(7, 4))
    plt.plot(epochs_range, history["train_loss"], marker="o", label="Train Loss")
    plt.plot(epochs_range, history["test_loss"],  marker="s", label="Test Loss")
    plt.xlabel("Epoch")
    plt.ylabel("MSE Loss (log scale)")
    plt.title(f"{model_name}: Train vs Test Loss")
    plt.legend()
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()

6.4.5. Train Model

Main adjustment from lab was swapping everything to be driven by the lookup row

Show code

def train_model(panel, config_row, eligibility_lookup, show_plot = True, verbose=True):

    # 1. Pull hparams from config row
    model_id     = config_row["model_id"]
    hidden_size  = int(config_row["hidden_size"])
    num_layers   = int(config_row["num_layers"])
    embedding_dim = int(config_row["embedding_dim"])
    batch_size   = int(config_row["batch_size"])
    epochs       = int(config_row["epochs"])
    lr           = float(config_row["learning_rate"])
    seed         = int(config_row["seed"])

    # 2. Reproducibility
    torch.manual_seed(seed)
    np.random.seed(seed)

    # 3. Build tensors via calling panel to tensors
    tensors, sw_id_to_int = panel_to_tensors(panel, config_row, eligibility_lookup)
    X_train, y_train, idx_train = tensors["train"]
    X_test,  y_test,  idx_test  = tensors["test"]
    n_features = X_train.shape[2]
    n_sewersheds = len(sw_id_to_int)
    
    # 4. Build datasets and loaders via sequence dataset
    train_ds = SequenceDataset(X_train, y_train, idx_train)
    test_ds  = SequenceDataset(X_test,  y_test,  idx_test)
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
    test_loader  = DataLoader(test_ds,  batch_size=batch_size, shuffle=False)

     # 5. Build model via LSTM Regressor
    model = LSTMRegressor(
        input_size=n_features,
        hidden_size=hidden_size,
        n_sewersheds=n_sewersheds,
        num_layers=num_layers,
        embedding_dim=embedding_dim,
    ).to(device)
    
    # 6. Loss + optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    # 7. Training loop — uses Step D helpers
    history = {"train_loss": [], "test_loss": []}
    best_test_loss = float("inf")
    best_state = None
    
    for epoch in range(1, epochs + 1):
        train_loss = train_one_epoch(model, train_loader, criterion, optimizer)
        test_loss  = evaluate_loss(model, test_loader, criterion)
        history["train_loss"].append(train_loss)
        history["test_loss"].append(test_loss)
        if test_loss < best_test_loss:
            best_test_loss = test_loss
            best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
        if verbose:
            print(f"{model_id} | Epoch {epoch}/{epochs} | Train: {train_loss:.4f} | Test: {test_loss:.4f}")

    # 8. Restore best state, get predictions
    model.load_state_dict(best_state)
    preds, targets, sw_idx = collect_predictions(model, test_loader)
    
    # 9. Compute metrics on count scale
    mae  = float(np.mean(np.abs(preds - targets)))
    rmse = float(np.sqrt(np.mean((preds - targets) ** 2)))
    
    # 10. Return everything
    
    if show_plot:
        plot_learning_curve(history, model_id)
    return {
        "model_id":     model_id,
        "model":        model,
        "history":      history,
        "preds":        preds,
        "targets":      targets,
        "sw_idx":       sw_idx,
        "sw_id_to_int": sw_id_to_int,
        "mae":          mae,
        "rmse":         rmse,
    }

Part 7. Modeling

7.1. Model 1 - Baseline

Covid prediction off of hospital data only - early period where there is more data.

Show code
result_m1 = train_model(panel, lu_models.iloc[0], eligibility_lookup)
M1 | Epoch 1/30 | Train: 4.0112 | Test: 0.8657
M1 | Epoch 2/30 | Train: 1.1826 | Test: 0.5648
M1 | Epoch 3/30 | Train: 0.5845 | Test: 0.5041
M1 | Epoch 4/30 | Train: 0.4317 | Test: 0.4880
M1 | Epoch 5/30 | Train: 0.3595 | Test: 0.4756
M1 | Epoch 6/30 | Train: 0.3205 | Test: 0.4776
M1 | Epoch 7/30 | Train: 0.2986 | Test: 0.4594
M1 | Epoch 8/30 | Train: 0.2862 | Test: 0.4534
M1 | Epoch 9/30 | Train: 0.2767 | Test: 0.4667
M1 | Epoch 10/30 | Train: 0.2699 | Test: 0.4514
M1 | Epoch 11/30 | Train: 0.2661 | Test: 0.4473
M1 | Epoch 12/30 | Train: 0.2621 | Test: 0.4557
M1 | Epoch 13/30 | Train: 0.2611 | Test: 0.4556
M1 | Epoch 14/30 | Train: 0.2597 | Test: 0.4657
M1 | Epoch 15/30 | Train: 0.2563 | Test: 0.4524
M1 | Epoch 16/30 | Train: 0.2534 | Test: 0.4540
M1 | Epoch 17/30 | Train: 0.2504 | Test: 0.4810
M1 | Epoch 18/30 | Train: 0.2501 | Test: 0.4885
M1 | Epoch 19/30 | Train: 0.2484 | Test: 0.4824
M1 | Epoch 20/30 | Train: 0.2457 | Test: 0.4814
M1 | Epoch 21/30 | Train: 0.2434 | Test: 0.5054
M1 | Epoch 22/30 | Train: 0.2442 | Test: 0.5047
M1 | Epoch 23/30 | Train: 0.2426 | Test: 0.4774
M1 | Epoch 24/30 | Train: 0.2401 | Test: 0.4661
M1 | Epoch 25/30 | Train: 0.2390 | Test: 0.4968
M1 | Epoch 26/30 | Train: 0.2402 | Test: 0.4904
M1 | Epoch 27/30 | Train: 0.2370 | Test: 0.4722
M1 | Epoch 28/30 | Train: 0.2387 | Test: 0.4849
M1 | Epoch 29/30 | Train: 0.2344 | Test: 0.4853
M1 | Epoch 30/30 | Train: 0.2346 | Test: 0.5079

Show code
print("\n=== M1 final ===")

=== M1 final ===
Show code
print(f"MAE:  {result_m1['mae']:.3f} admissions")
MAE:  4.700 admissions
Show code
print(f"RMSE: {result_m1['rmse']:.3f} admissions")
RMSE: 9.945 admissions
Show code
print(f"Best test loss (log MSE): {min(result_m1['history']['test_loss']):.4f}")
Best test loss (log MSE): 0.4473
Show code
plot_learning_curve(result_m1["history"], "M1")

7.2 M2 — hyperparameter sweep

In order to find the best hyperparameters for this model, we will loop through the “sweep” section of the lookup.

Show code
sweep_indices = lu_models[lu_models["stage"] == "sweep"].index.tolist()

print(sweep_indices)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]
Show code
sweep_results = []   # initialize
for idx in sweep_indices:
    config_row = lu_models.loc[idx]
    result = train_model(panel, config_row, eligibility_lookup, show_plot=False, verbose=False)
    sweep_results.append({
        "model_id":       result["model_id"],
        "seq_len":        int(config_row["seq_len"]),
        "hidden_size":    int(config_row["hidden_size"]),
        "learning_rate":  float(config_row["learning_rate"]),
        "mae":            result["mae"],
        "rmse":           result["rmse"],
        "best_test_loss": min(result["history"]["test_loss"]),
    })

sweep_df = pd.DataFrame(sweep_results)

## Sort by RMSE to get best fit
scrollable(sweep_df.sort_values("rmse").reset_index(drop=True))
model_id seq_len hidden_size learning_rate mae rmse best_test_loss
M2_07 4 64 0.0100 3.855698 7.634340 0.353938
M2_13 8 32 0.0100 4.171322 8.319045 0.388757
M2_04 4 32 0.0100 4.153130 8.767463 0.357772
M2_08 4 64 0.0010 4.318702 8.947997 0.388947
M2_18 8 64 0.0001 4.480975 9.049541 0.422520
M2_09 4 64 0.0001 4.428402 9.138783 0.408503
M2_01 4 16 0.0100 4.314425 9.192862 0.370301
M2_17 8 64 0.0010 4.515060 9.391414 0.404783
M2_11 8 16 0.0010 4.531479 9.445838 0.389287
M2_05 4 32 0.0010 4.422468 9.483552 0.393156
M2_25 12 64 0.0100 4.576123 9.505790 0.408910
M2_26 12 64 0.0010 4.615394 9.602592 0.423156
M2_27 12 64 0.0001 4.722316 9.676497 0.470032
M2_02 4 16 0.0010 4.474056 9.722293 0.390573
M2_14 8 32 0.0010 4.610190 9.852764 0.403625
M2_10 8 16 0.0100 4.404883 9.866540 0.387222
M2_06 4 32 0.0001 4.795813 9.906642 0.440181
M2_23 12 32 0.0010 4.700381 9.944742 0.447316
M2_20 12 16 0.0010 4.729808 10.052602 0.431354
M2_19 12 16 0.0100 4.748424 10.224441 0.405565
M2_15 8 32 0.0001 4.940846 10.311592 0.443992
M2_16 8 64 0.0100 4.549072 10.457841 0.380847
M2_22 12 32 0.0100 4.858771 10.999123 0.418311
M2_24 12 32 0.0001 5.300166 11.588200 0.484095
M2_03 4 16 0.0001 7.211026 16.825611 0.714947
M2_12 8 16 0.0001 7.483019 17.326490 0.790140
M2_21 12 16 0.0001 7.787136 17.897711 0.896242

Based on the results of the hyperparameter sweep, I chose M2_13 for the optimized run.

The items with a Seq Len of 12 performed consistently poorly - this makes sense given the limited data. - seq_len = 8 - hidden_size = 32 - learning_rate = 0.01

I have input these back into the lookup for Tuned M2 results, M3_1_1, M3_1_2, M3_2.

7.3. - Final Output of M2

Show code
config_m2 = lu_models[lu_models["model_id"] == "M2"].iloc[0]
result_m2 = train_model(panel, config_m2, eligibility_lookup)
M2 | Epoch 1/30 | Train: 1.0215 | Test: 0.4522
M2 | Epoch 2/30 | Train: 0.3378 | Test: 0.3991
M2 | Epoch 3/30 | Train: 0.3321 | Test: 0.4200
M2 | Epoch 4/30 | Train: 0.3272 | Test: 0.3888
M2 | Epoch 5/30 | Train: 0.3209 | Test: 0.4225
M2 | Epoch 6/30 | Train: 0.3089 | Test: 0.4147
M2 | Epoch 7/30 | Train: 0.3025 | Test: 0.4045
M2 | Epoch 8/30 | Train: 0.2997 | Test: 0.4169
M2 | Epoch 9/30 | Train: 0.2893 | Test: 0.4547
M2 | Epoch 10/30 | Train: 0.2846 | Test: 0.3928
M2 | Epoch 11/30 | Train: 0.2761 | Test: 0.4003
M2 | Epoch 12/30 | Train: 0.2747 | Test: 0.4311
M2 | Epoch 13/30 | Train: 0.2672 | Test: 0.3999
M2 | Epoch 14/30 | Train: 0.2645 | Test: 0.4476
M2 | Epoch 15/30 | Train: 0.2632 | Test: 0.4535
M2 | Epoch 16/30 | Train: 0.2597 | Test: 0.4288
M2 | Epoch 17/30 | Train: 0.2661 | Test: 0.3972
M2 | Epoch 18/30 | Train: 0.2613 | Test: 0.4126
M2 | Epoch 19/30 | Train: 0.2597 | Test: 0.4450
M2 | Epoch 20/30 | Train: 0.2518 | Test: 0.4537
M2 | Epoch 21/30 | Train: 0.2497 | Test: 0.4512
M2 | Epoch 22/30 | Train: 0.2451 | Test: 0.4356
M2 | Epoch 23/30 | Train: 0.2430 | Test: 0.4237
M2 | Epoch 24/30 | Train: 0.2421 | Test: 0.4238
M2 | Epoch 25/30 | Train: 0.2359 | Test: 0.4347
M2 | Epoch 26/30 | Train: 0.2317 | Test: 0.4218
M2 | Epoch 27/30 | Train: 0.2329 | Test: 0.4421
M2 | Epoch 28/30 | Train: 0.2315 | Test: 0.4189
M2 | Epoch 29/30 | Train: 0.2280 | Test: 0.4545
M2 | Epoch 30/30 | Train: 0.2243 | Test: 0.4574

Show code
print("\n=== M2 final ===")

=== M2 final ===
Show code
print(f"M2 final | MAE: {result_m2['mae']:.3f} | RMSE: {result_m2['rmse']:.3f}")
M2 final | MAE: 4.171 | RMSE: 8.319
Show code
plot_learning_curve(result_m2["history"], "M2")

7.4. M3.1.1 — Late-window transfer - Retain full Hospital list

This model will compare the results of the early window to the results of the full window, using the same hyperparameters.

Show code
config_m3_1_1 = lu_models[lu_models["model_id"] == "M3_1_1"].iloc[0]
result_m3_1_1 = train_model(panel, config_m3_1_1, eligibility_lookup)
M3_1_1 | Epoch 1/30 | Train: 0.4323 | Test: 0.3371
M3_1_1 | Epoch 2/30 | Train: 0.3184 | Test: 0.3282
M3_1_1 | Epoch 3/30 | Train: 0.3184 | Test: 0.3345
M3_1_1 | Epoch 4/30 | Train: 0.3046 | Test: 0.3139
M3_1_1 | Epoch 5/30 | Train: 0.2969 | Test: 0.3193
M3_1_1 | Epoch 6/30 | Train: 0.2912 | Test: 0.3062
M3_1_1 | Epoch 7/30 | Train: 0.2905 | Test: 0.3447
M3_1_1 | Epoch 8/30 | Train: 0.2915 | Test: 0.3074
M3_1_1 | Epoch 9/30 | Train: 0.2865 | Test: 0.3076
M3_1_1 | Epoch 10/30 | Train: 0.2885 | Test: 0.3012
M3_1_1 | Epoch 11/30 | Train: 0.2881 | Test: 0.3042
M3_1_1 | Epoch 12/30 | Train: 0.2881 | Test: 0.3124
M3_1_1 | Epoch 13/30 | Train: 0.2903 | Test: 0.3114
M3_1_1 | Epoch 14/30 | Train: 0.2861 | Test: 0.3017
M3_1_1 | Epoch 15/30 | Train: 0.2837 | Test: 0.3050
M3_1_1 | Epoch 16/30 | Train: 0.2875 | Test: 0.3148
M3_1_1 | Epoch 17/30 | Train: 0.2822 | Test: 0.3070
M3_1_1 | Epoch 18/30 | Train: 0.2846 | Test: 0.3022
M3_1_1 | Epoch 19/30 | Train: 0.2842 | Test: 0.3412
M3_1_1 | Epoch 20/30 | Train: 0.2825 | Test: 0.3027
M3_1_1 | Epoch 21/30 | Train: 0.2831 | Test: 0.3099
M3_1_1 | Epoch 22/30 | Train: 0.2837 | Test: 0.3149
M3_1_1 | Epoch 23/30 | Train: 0.2793 | Test: 0.3057
M3_1_1 | Epoch 24/30 | Train: 0.2808 | Test: 0.3031
M3_1_1 | Epoch 25/30 | Train: 0.2813 | Test: 0.3137
M3_1_1 | Epoch 26/30 | Train: 0.2796 | Test: 0.3144
M3_1_1 | Epoch 27/30 | Train: 0.2755 | Test: 0.3136
M3_1_1 | Epoch 28/30 | Train: 0.2781 | Test: 0.3154
M3_1_1 | Epoch 29/30 | Train: 0.2787 | Test: 0.3131
M3_1_1 | Epoch 30/30 | Train: 0.2786 | Test: 0.3146

Show code
print("\n=== M3.1.1 final ===")

=== M3.1.1 final ===
Show code
print(f"M3.1.1 final | MAE: {result_m3_1_1['mae']:.3f} | RMSE: {result_m3_1_1['rmse']:.3f}")
M3.1.1 final | MAE: 3.021 | RMSE: 6.062
Show code
plot_learning_curve(result_m3_1_1["history"], "M3.1 - All Hospitals")

7.5. M3.1.2 — Late-window transfer - Reduced Hospital List

This model is trained off less data, for the full range. Expect to see loss of predictive value. Hopefully able to introduce more data later.

Show code
config_m3_1_2 = lu_models[lu_models["model_id"] == "M3_1_2"].iloc[0]
result_m3_1_2 = train_model(panel, config_m3_1_2, eligibility_lookup)
M3_1_2 | Epoch 1/30 | Train: 0.5007 | Test: 0.3424
M3_1_2 | Epoch 2/30 | Train: 0.3281 | Test: 0.3321
M3_1_2 | Epoch 3/30 | Train: 0.3157 | Test: 0.3209
M3_1_2 | Epoch 4/30 | Train: 0.3072 | Test: 0.3139
M3_1_2 | Epoch 5/30 | Train: 0.3058 | Test: 0.3115
M3_1_2 | Epoch 6/30 | Train: 0.2996 | Test: 0.3053
M3_1_2 | Epoch 7/30 | Train: 0.2975 | Test: 0.3115
M3_1_2 | Epoch 8/30 | Train: 0.3000 | Test: 0.3049
M3_1_2 | Epoch 9/30 | Train: 0.2975 | Test: 0.3162
M3_1_2 | Epoch 10/30 | Train: 0.2963 | Test: 0.3060
M3_1_2 | Epoch 11/30 | Train: 0.2940 | Test: 0.3095
M3_1_2 | Epoch 12/30 | Train: 0.2951 | Test: 0.3098
M3_1_2 | Epoch 13/30 | Train: 0.2940 | Test: 0.3049
M3_1_2 | Epoch 14/30 | Train: 0.2937 | Test: 0.3012
M3_1_2 | Epoch 15/30 | Train: 0.2942 | Test: 0.3100
M3_1_2 | Epoch 16/30 | Train: 0.2910 | Test: 0.2991
M3_1_2 | Epoch 17/30 | Train: 0.2939 | Test: 0.2990
M3_1_2 | Epoch 18/30 | Train: 0.2891 | Test: 0.2999
M3_1_2 | Epoch 19/30 | Train: 0.2890 | Test: 0.3012
M3_1_2 | Epoch 20/30 | Train: 0.2918 | Test: 0.3024
M3_1_2 | Epoch 21/30 | Train: 0.2901 | Test: 0.3017
M3_1_2 | Epoch 22/30 | Train: 0.2875 | Test: 0.3066
M3_1_2 | Epoch 23/30 | Train: 0.2879 | Test: 0.3037
M3_1_2 | Epoch 24/30 | Train: 0.2871 | Test: 0.3084
M3_1_2 | Epoch 25/30 | Train: 0.2863 | Test: 0.3089
M3_1_2 | Epoch 26/30 | Train: 0.2847 | Test: 0.3111
M3_1_2 | Epoch 27/30 | Train: 0.2821 | Test: 0.3106
M3_1_2 | Epoch 28/30 | Train: 0.2815 | Test: 0.3158
M3_1_2 | Epoch 29/30 | Train: 0.2807 | Test: 0.3125
M3_1_2 | Epoch 30/30 | Train: 0.2796 | Test: 0.3228

Show code
print("\n=== M3.1.2 final ===")

=== M3.1.2 final ===
Show code
print(f"M3.1.2 final | MAE: {result_m3_1_2['mae']:.3f} | RMSE: {result_m3_1_2['rmse']:.3f}")
M3.1.2 final | MAE: 2.496 | RMSE: 5.132
Show code
plot_learning_curve(result_m3_1_2["history"], "M3.1 - Limited Hospital List")

7.6. M3.2 — Late-window data - reduced hospital list + Wastewater Signal

Show code
config_m3_2 = lu_models[lu_models["model_id"] == "M3_2"].iloc[0]
result_m3_2 = train_model(panel, config_m3_2, eligibility_lookup)
M3_2 | Epoch 1/30 | Train: 0.4871 | Test: 0.3684
M3_2 | Epoch 2/30 | Train: 0.3318 | Test: 0.3505
M3_2 | Epoch 3/30 | Train: 0.3230 | Test: 0.3413
M3_2 | Epoch 4/30 | Train: 0.3139 | Test: 0.3379
M3_2 | Epoch 5/30 | Train: 0.3060 | Test: 0.3189
M3_2 | Epoch 6/30 | Train: 0.2968 | Test: 0.3076
M3_2 | Epoch 7/30 | Train: 0.2948 | Test: 0.3106
M3_2 | Epoch 8/30 | Train: 0.2920 | Test: 0.3013
M3_2 | Epoch 9/30 | Train: 0.2862 | Test: 0.3045
M3_2 | Epoch 10/30 | Train: 0.2886 | Test: 0.2970
M3_2 | Epoch 11/30 | Train: 0.2855 | Test: 0.3157
M3_2 | Epoch 12/30 | Train: 0.2841 | Test: 0.2957
M3_2 | Epoch 13/30 | Train: 0.2843 | Test: 0.3113
M3_2 | Epoch 14/30 | Train: 0.2847 | Test: 0.3116
M3_2 | Epoch 15/30 | Train: 0.2859 | Test: 0.2968
M3_2 | Epoch 16/30 | Train: 0.2834 | Test: 0.3089
M3_2 | Epoch 17/30 | Train: 0.2796 | Test: 0.3049
M3_2 | Epoch 18/30 | Train: 0.2791 | Test: 0.2999
M3_2 | Epoch 19/30 | Train: 0.2813 | Test: 0.2997
M3_2 | Epoch 20/30 | Train: 0.2805 | Test: 0.3172
M3_2 | Epoch 21/30 | Train: 0.2810 | Test: 0.2976
M3_2 | Epoch 22/30 | Train: 0.2820 | Test: 0.3060
M3_2 | Epoch 23/30 | Train: 0.2756 | Test: 0.3012
M3_2 | Epoch 24/30 | Train: 0.2738 | Test: 0.3113
M3_2 | Epoch 25/30 | Train: 0.2763 | Test: 0.3222
M3_2 | Epoch 26/30 | Train: 0.2745 | Test: 0.3005
M3_2 | Epoch 27/30 | Train: 0.2808 | Test: 0.3075
M3_2 | Epoch 28/30 | Train: 0.2788 | Test: 0.3170
M3_2 | Epoch 29/30 | Train: 0.2700 | Test: 0.3057
M3_2 | Epoch 30/30 | Train: 0.2681 | Test: 0.3168

Show code
print("\n=== M3.2 final ===")

=== M3.2 final ===
Show code
print(f"M3.2 final | MAE: {result_m3_2['mae']:.3f} | RMSE: {result_m3_2['rmse']:.3f}")
M3.2 final | MAE: 2.500 | RMSE: 5.148
Show code
plot_learning_curve(result_m3_2["history"], "M3.2 - Hospitals with WW Data")

7.7. Export Model Results (by SW) to Outputs Folder

Show code

def per_sw_mae(result):
    int_to_sw_id = {v: k for k, v in result["sw_id_to_int"].items()}
    df = pd.DataFrame({
        "sw_idx": result["sw_idx"],
        "abs_err": np.abs(result["preds"] - result["targets"]),
        "actual": result["targets"],
    })
    out = (df.groupby("sw_idx")
           .agg(mae=("abs_err", "mean"),
                n_test_weeks=("abs_err", "size"),
                mean_actual=("actual", "mean"))
           .reset_index())
    out["sw_id"] = out["sw_idx"].map(int_to_sw_id)
    return out[["sw_id", "mae", "n_test_weeks", "mean_actual"]]


per_sw_mae(result_m1).to_csv("data/outputs/per_sw_mae_m1.csv", index=False)
per_sw_mae(result_m2).to_csv("data/outputs/per_sw_mae_m2.csv", index=False)
per_sw_mae(result_m3_1_1).to_csv("data/outputs/per_sw_mae_m3_1_1.csv", index=False)
per_sw_mae(result_m3_1_2).to_csv("data/outputs/per_sw_mae_m3_1_2.csv", index=False)
per_sw_mae(result_m3_2  ).to_csv("data/outputs/per_sw_mae_m3_2.csv",   index=False)

7.8. Input Back into R

Show code
results <- list(
  m1     = read.csv("data/outputs/per_sw_mae_m1.csv"),
  m2     = read.csv("data/outputs/per_sw_mae_m2.csv"),
  m3_1_1 = read.csv("data/outputs/per_sw_mae_m3_1_1.csv"),
  m3_1_2 = read.csv("data/outputs/per_sw_mae_m3_1_2.csv"),
  m3_2   = read.csv("data/outputs/per_sw_mae_m3_2.csv")
)

Part 8. Evaluation

8.1. Final Model Comparison

Show code
comparison = pd.DataFrame([
    {"Model": "M1",     "Window": "2020-09 → 2021-09", "Sewersheds": 108,
     "Features": "hosp-only",         "MAE": result_m1["mae"],     "RMSE": result_m1["rmse"]},
    {"Model": "M2",     "Window": "2020-09 → 2021-09", "Sewersheds": 108,
     "Features": "hosp-only (tuned)", "MAE": result_m2["mae"],     "RMSE": result_m2["rmse"]},
    {"Model": "M3.1.1", "Window": "2023-01 → 2024-12", "Sewersheds": 108,
     "Features": "hosp-only",         "MAE": result_m3_1_1["mae"], "RMSE": result_m3_1_1["rmse"]},
    {"Model": "M3.1.2", "Window": "2023-01 → 2024-12", "Sewersheds": 69,
     "Features": "hosp-only",         "MAE": result_m3_1_2["mae"], "RMSE": result_m3_1_2["rmse"]},
    {"Model": "M3.2",   "Window": "2023-01 → 2024-12", "Sewersheds": 69,
     "Features": "hosp + WW",         "MAE": result_m3_2["mae"],   "RMSE": result_m3_2["rmse"]},
])

comparison["MAE"]  = comparison["MAE"].round(3)
comparison["RMSE"] = comparison["RMSE"].round(3)

scrollable(comparison)
Model Window Sewersheds Features MAE RMSE
M1 2020-09 → 2021-09 108 hosp-only 4.700 9.945
M2 2020-09 → 2021-09 108 hosp-only (tuned) 4.171 8.319
M3.1.1 2023-01 → 2024-12 108 hosp-only 3.021 6.062
M3.1.2 2023-01 → 2024-12 69 hosp-only 2.496 5.132
M3.2 2023-01 → 2024-12 69 hosp + WW 2.500 5.148

Research Question 1.

1a. M1, the untuned model, was set with a sequence window of 12 weeks, a learning rate of 0.001, and a hidden size of 32. M2, the sweep winner, used a sequence length of 8, a learning rate of 0.01, and a hidden size of 32 — same architecture, just smaller window and faster learning rate. M1’s typical prediction was off by ~4.7 admissions; M2 reduced that to ~4.2.

The hyperparameter sweep did not produce a singular obvious winner — MAEs across configurations clustered closely, and the ranking was sensitive to seed-to-seed variation (seed 508 vs seed 408 produced different sweep winners). However, the 12-week sequence configurations were consistently among the worst-performing across both seeds, likely because every additional week of lookback also reduces the number of training samples available — costly when the early-window dataset already has only ~52 weeks per sewershed.

1b.

The greatest model improvement was seen comparing M2 to M3.1.1 — both use the same 108-sewershed eligibility set and the same hyperparameters, but M3.1.1 trains on roughly twice as many weeks of data (the late-window 2023-2024 period vs M2’s early-window 2020-2021). M3.1.1’s average prediction error was ~3 admissions, compared to M2’s ~4.2.

Research Question 2.

The addition of wastewater surveillance data in M3.2 does not appear to have improved predictive power compared to M3.1.2. Trained on the same sewersheds, with the same hyperparameters, in the same date range, both models produced nearly identical MAEs (~2.5 admissions/week). Seed-to-seed comparison (seed 508 vs 408) reinforces this null finding: under the alternate seed, M3.2 was nominally better than M3.1.2 by a comparably small margin in the opposite direction. The difference between including and excluding wastewater is well within the noise floor produced by training stochasticity alone.

8.2. Per-sewershed error maps

8.2.1. Comparison Between M3.1.1 (108 sewersheds) vs M3.1.2 (69 sewersheds)

Show code
mae_compare <- bind_rows(
  results$m3_1_1 |> mutate(model = "M3.1.1: 108 sw, hosp-only"),
  results$m3_1_2 |> mutate(model = "M3.1.2: 69 sw, hosp-only")
)

# Cross polygons with both model labels so M3.1.2's missing sewersheds show as grey
map_sewer_compare <- map_lu_sewer |>
  select(sw_id) |>
  tidyr::crossing(model = unique(mae_compare$model)) |>
  left_join(mae_compare, by = c("sw_id", "model")) |>
  st_as_sf()


ggplot(map_sewer_compare) +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey80",
          inherit.aes = FALSE) +
  geom_sf(aes(fill = mae), color = "white", linewidth = 0.1) +
  facet_wrap(~ model, ncol = 2) +
  scale_fill_viridis_c(option = "magma", direction = -1,
                       name = "MAE\n(admissions)",
                       na.value = "grey85") +
  labs(title    = "Per-sewershed forecast error: full vs WW-eligible subset",
       subtitle = "Grey = sewershed excluded from M3.1.2 (no WW coverage ≥ 70%)") +
  theme_void() +
  theme(strip.text = element_text(size = 11))

This map confirms that the 39 sewersheds excluded from the WW-eligible subset cluster in Long Island, parts of NYC, and the Buffalo area. Some of these missing sewersheds appear to have some of the highest error rates in the M3.1.1 model - which aligns with the observation from the full model comparison earlier - some of the sewersheds that were dropped for wastewater surveillance coverage in the analysis are the most difficult sewersheds to predict.

8.2.2. Error Map of Final WWTP Model

Show code
# Per-sewershed delta: positive = WW hurt, negative = WW helped
delta <- results$m3_1_2 |>
  select(sw_id, mae_hosp_only = mae) |>
  inner_join(
    results$m3_2 |> select(sw_id, mae_with_ww = mae),
    by = "sw_id"
  ) |>
  mutate(delta_mae = mae_with_ww - mae_hosp_only)

# Quick sanity print before mapping — useful for the video narration
print(summary(delta$delta_mae))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5.277017 -0.064053 -0.002640  0.004293  0.058099  5.193978 
Show code
print(paste("WW helped:", sum(delta$delta_mae < 0), "sewersheds"))
[1] "WW helped: 35 sewersheds"
Show code
print(paste("WW hurt:",   sum(delta$delta_mae > 0), "sewersheds"))
[1] "WW hurt: 34 sewersheds"
Show code
# Join to polygons
map_delta <- map_lu_sewer |>
  inner_join(delta, by = "sw_id")

# Diverging scale, centered at zero
delta_max <- max(abs(map_delta$delta_mae), na.rm = TRUE)

ggplot(map_delta) +
  geom_sf(data = map_ny_counties, fill = "white", color = "grey80",
          inherit.aes = FALSE) +
  geom_sf(aes(fill = delta_mae), color = "white", linewidth = 0.1) +
  scale_fill_gradient2(
    low = "#1b7837", mid = "grey95", high = "#762a83",
    midpoint = 0, limits = c(-delta_max, delta_max),
    name = "Δ MAE\n(WW − hosp-only)"
  ) +
  labs(
    title    = "Marginal effect of wastewater signal: M3.2 − M3.1.2",
    subtitle = "Green = WW improved forecast; purple = WW hurt; grey = no change",
    caption  = "Same 69 sewersheds, same hyperparameters, only feature set differs"
  ) +
  theme_void()

Most of New York State sits at or near zero on this map — adding wastewater data shifted per-sewershed forecasts by less than half an admission per week in the vast majority of sewersheds. A handful of sewersheds show larger swings (~5 admissions), with one large positive shift near NYC (WW worsened the forecast there) and one large negative shift in central NY (WW improved it).

However, comparing this map to the equivalent map produced under a different seed (seed 408) reveals that these per-sewershed shifts are not stable: individual sewersheds change color and magnitude substantially between runs, while the pooled MAE remains essentially unchanged. The cleanest interpretation of these changing results is that the per-sewershed effects reflect optimization variability rather than a stable wastewater effect — adding WW surveillance does not consistently help or hurt forecasts at any specific location.

Combined with the near-zero pooled difference in 8.1, the spatial story confirms the headline finding: wastewater surveillance data does not measurably improve sewershed-level COVID-19 hospitalization forecasting beyond what hospital occupancy features already provide in these models.