Preamble

This notebook performs an end-to-end wastewater-based epidemiology (WBE) analysis of SARS-CoV-2 surveillance data from the Point Loma WWTP sewershed (San Diego, CA), covering the five-month window October 1, 2021 – February 28, 2022. This window spans the tail end of the Delta wave and the full emergence of Omicron — a period analyzed in Karthikeyan et al., 2022 (Nature), which demonstrated that wastewater sequencing detected variant emergence weeks before clinical confirmation.

The analysis follows seven sequential steps mirroring operational biosurveillance workflows: Sample QC → Concentration transformation → Baseline & activity-level classification → Trend detection → Variant deconvolution → Emerging lineage detection → Clinical integration.

Package dependencies

# Install any missing packages automatically
pkgs <- c(
  "tidyverse", "lubridate", "zoo", "slider", "TTR",
  "changepoint", "trend", "broom", "knitr", "kableExtra",
  "scales", "patchwork", "ggrepel", "viridis", "RColorBrewer"
)
to_install <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")

library(tidyverse)
library(lubridate)
library(zoo)
library(slider)
library(TTR)        # EMA / SMA helpers
library(changepoint)
library(trend)      # Mann-Kendall
library(broom)
library(knitr)
library(kableExtra)
library(scales)
library(patchwork)
library(ggrepel)
library(viridis)
library(RColorBrewer)
library(readr)

theme_set(theme_bw(base_size = 12))

Data loading

Note on data access: The dataset is distributed via the Latch Bio console (https://console.latch.bio/s/20805773099779822). Download the bundle sars_cov2_wastewater_takehome/ and place the three files — qpcr.csv, freyja.tsv, and cases.csv — in your working directory before knitting.

getwd()

## [1] "C:/Users/Shahab/Desktop/Homework"

setwd("C:/Users/Shahab/Desktop/Homework")
# ── Load raw data ─────────────────────────────────────────────────────────────
qpcr   <- read_csv("qpcr.csv",   show_col_types = FALSE)
freyja <- read_tsv("freyja.tsv", show_col_types = FALSE)
cases  <- read_csv("cases.csv",  show_col_types = FALSE)

# ── Harmonise date columns ────────────────────────────────────────────────────
qpcr   <- qpcr   %>% mutate(date = as_date(date))
freyja <- freyja %>% mutate(date = as_date(date))
cases  <- cases  %>% mutate(week_ending = as_date(week_ending))

# ── Quick look ────────────────────────────────────────────────────────────────
glimpse(qpcr)

## Rows: 60
## Columns: 9
## $ date                <date> 2021-10-04, 2021-10-06, 2021-10-10, 2021-10-11, 2…
## $ site                <chr> "PointLoma", "PointLoma", "PointLoma", "PointLoma"…
## $ pcr_target_avg_conc <dbl> 248673, 218099, 248399, 409576, 653372, 703496, 52…
## $ pmmov_copies_per_L  <dbl> 31884542, 40272628, 45881970, 53495335, 35787131, …
## $ flow_mgd            <dbl> 166.08, 154.12, 131.01, 160.94, 146.79, 130.32, 14…
## $ detection_flag      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lab_method          <chr> "ddPCR_N1N2_PEG_concentrated", "ddPCR_N1N2_PEG_con…
## $ replicate_cv        <dbl> 0.316, 0.174, 0.311, 0.177, 0.104, 0.191, 0.259, 0…
## $ population_served   <dbl> 2300000, 2300000, 2300000, 2300000, 2300000, 23000…

glimpse(freyja)

## Rows: 40
## Columns: 59
## $ date                       <date> 2021-10-04, 2021-10-06, 2021-10-10, 2021-1…
## $ site                       <chr> "PointLoma", "PointLoma", "PointLoma", "Poi…
## $ BA.1                       <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0…
## $ BA.1.1.X                   <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0…
## $ BA.2.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BA.2.12.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BA.4.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BA.5.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ B.1.1.529                  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0…
## $ AY.113                     <dbl> 0.018, 0.003, 0.014, 0.017, 0.094, 0.175, 0…
## $ AY.100                     <dbl> 0.080, 0.000, 0.288, 0.053, 0.222, 0.107, 0…
## $ AY.20                      <dbl> 0.000, 0.011, 0.077, 0.042, 0.051, 0.061, 0…
## $ AY.25                      <dbl> 0.371, 0.437, 0.111, 0.011, 0.211, 0.111, 0…
## $ AY.3                       <dbl> 0.019, 0.000, 0.012, 0.000, 0.059, 0.107, 0…
## $ AY.44                      <dbl> 0.131, 0.138, 0.333, 0.406, 0.121, 0.107, 0…
## $ AY.119                     <dbl> 0.011, 0.000, 0.000, 0.013, 0.013, 0.025, 0…
## $ AY.3.1                     <dbl> 0.007, 0.000, 0.000, 0.001, 0.000, 0.050, 0…
## $ AY.103                     <dbl> 0.345, 0.410, 0.165, 0.449, 0.167, 0.143, 0…
## $ AY.46.4                    <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0…
## $ AY.25.1                    <dbl> 0.000, 0.000, 0.000, 0.000, 0.018, 0.026, 0…
## $ AY.116                     <dbl> 0.000, 0.000, 0.000, 0.000, 0.041, 0.044, 0…
## $ AY.43.4                    <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0…
## $ `Other Delta sub-lineages` <dbl> 0.017, 0.000, 0.000, 0.009, 0.006, 0.045, 0…
## $ BA.2.75                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BA.4.6                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BQ.1.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BQ.1.1.X                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BF.7.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BN.1.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XBB.X                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XBB.1.5.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XBB.1.9.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XBB.1.16.X                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XBB.2.3.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ EG.5.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ BA.2.86.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HV.1.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JN.1.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JN.1.7.X                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JN.1.4.X                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ KQ.1.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JN.1.11.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ KP.2.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Recombinants               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Other                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XDP.1                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ JN.1.16.X                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XDP                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ KP.3.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ LB.1.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ KP.1.1.X                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ KP.3.1.1.X                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Delta                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XEC.X                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ LP.8.1.X                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ LP.8.X                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NB.1.8.1.X                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ XFG.X                      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ residual                   <dbl> 0.001, 0.001, 0.000, 0.000, 0.000, 0.000, 0…

glimpse(cases)

## Rows: 23
## Columns: 4
## $ week_ending     <date> 2021-10-02, 2021-10-09, 2021-10-16, 2021-10-23, 2021-…
## $ county          <chr> "San Diego", "San Diego", "San Diego", "San Diego", "S…
## $ weekly_cases    <dbl> 836, 3250, 3070, 3451, 3381, 3375, 3241, 3386, 3073, 5…
## $ n_days_reported <dbl> 2, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, …

Step 1 — Sample QC

Rationale

Raw wastewater measurements are subject to several sources of technical noise: inhibited PCR reactions, failed extractions, atypical flow events (rain dilution, industrial discharge), and below-LOD samples. Before any epidemiological inference we must identify and flag low-quality samples using orthogonal indicators built into the data:

QC criterion	Variable used	Rationale
Detection flag	`detection_flag`	Lab-reported below-LOD
Replicate CV	`replicate_cv`	High CV (>25–30 %) indicates technical irreproducibility
PMMoV anchor	`pmmov_copies_per_L`	Pepper Mild Mottle Virus is a human-fecal indicator; extreme outliers suggest matrix failure
Flow rate	`flow_mgd`	Extreme flow dilutes viral signal; flag samples outside ±3 SD of site mean

PMMoV is used as a fecal-strength indicator but not as a normalization divisor in the primary WVAL calculation (following the August 2025 CDC update reverting to non-normalized data). It is retained here for QC and as a sensitivity-check covariate.

# ── Flag rules ────────────────────────────────────────────────────────────────
flow_mean <- mean(qpcr$flow_mgd, na.rm = TRUE)
flow_sd   <- sd(qpcr$flow_mgd,   na.rm = TRUE)

pmmov_mean <- mean(log10(qpcr$pmmov_copies_per_L + 1), na.rm = TRUE)
pmmov_sd   <- sd(log10(qpcr$pmmov_copies_per_L + 1),   na.rm = TRUE)

qpcr_qc <- qpcr %>%
  mutate(
    # Individual QC flags (TRUE = problem)
    flag_detection   = detection_flag %in% c("non-detect", "below_LOD", FALSE) |
                       pcr_target_avg_conc <= 0,
    flag_cv          = !is.na(replicate_cv) & replicate_cv > 30,
    flag_flow        = !is.na(flow_mgd) &
                       abs(flow_mgd - flow_mean) > 3 * flow_sd,
    flag_pmmov       = !is.na(pmmov_copies_per_L) &
                       abs(log10(pmmov_copies_per_L + 1) - pmmov_mean) > 3 * pmmov_sd,
    # Composite
    qc_fail = flag_detection | flag_cv | flag_flow | flag_pmmov,
    qc_pass = !qc_fail
  )

# ── QC summary table ──────────────────────────────────────────────────────────
qpcr_qc %>%
  summarise(
    n_total        = n(),
    n_nondetect    = sum(flag_detection,    na.rm = TRUE),
    n_high_cv      = sum(flag_cv,           na.rm = TRUE),
    n_extreme_flow = sum(flag_flow,         na.rm = TRUE),
    n_extreme_pmmov= sum(flag_pmmov,        na.rm = TRUE),
    n_fail_any     = sum(qc_fail,           na.rm = TRUE),
    pct_pass       = round(100 * mean(qc_pass, na.rm = TRUE), 1)
  ) %>%
  kable(caption = "Table 1. QC flag summary") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 1. QC flag summary
n_total	n_nondetect	n_high_cv	n_extreme_flow	n_extreme_pmmov	n_fail_any	pct_pass
60	0	0	1	1	2	96.7

p_flow <- ggplot(qpcr_qc, aes(date, flow_mgd, colour = qc_fail)) +
  geom_point(size = 1.8, alpha = 0.8) +
  scale_colour_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
                      labels = c("Pass", "Fail")) +
  labs(title = "Flow rate over time", x = NULL, y = "Flow (MGD)", colour = "QC") +
  theme(legend.position = "top")

p_cv <- ggplot(qpcr_qc %>% filter(!is.na(replicate_cv)),
               aes(date, replicate_cv, colour = flag_cv)) +
  geom_point(size = 1.8, alpha = 0.8) +
  geom_hline(yintercept = 30, linetype = "dashed", colour = "firebrick") +
  scale_colour_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
                      labels = c("<= 30 %", "> 30 %")) +
  labs(title = "Replicate CV over time",
       x = NULL, y = "Replicate CV (%)", colour = "CV flag") +
  theme(legend.position = "top")

p_pmmov <- ggplot(qpcr_qc, aes(date, log10(pmmov_copies_per_L + 1),
                                colour = flag_pmmov)) +
  geom_point(size = 1.8, alpha = 0.8) +
  scale_colour_manual(values = c("FALSE" = "steelblue", "TRUE" = "firebrick"),
                      labels = c("Normal", "Outlier")) +
  labs(title = "PMMoV (log₁₀) over time",
       x = NULL, y = "log₁₀ PMMoV (copies/L)", colour = "PMMoV flag") +
  theme(legend.position = "top")

p_flow / p_cv / p_pmmov

# Working dataset: QC-pass only, positive concentrations
qpcr_clean <- qpcr_qc %>%
  filter(qc_pass, pcr_target_avg_conc > 0) %>%
  arrange(date)

cat("Samples retained after QC:", nrow(qpcr_clean), "/", nrow(qpcr_qc), "\n")

## Samples retained after QC: 58 / 60

Step 1 — MCQ

After applying multi-criterion QC (non-detect flag, replicate CV > 30 %, ±3 SD flow outlier, ±3 SD PMMoV outlier), which statement best describes the overall data quality and the principal reason samples failed QC at Point Loma during this surveillance window?

A) The majority of samples failed QC, with high replicate CV being the dominant failure mode — consistent with sustained PCR inhibition in this matrix.

B) A small minority of samples failed QC; the dominant failure mode was extreme flow-rate events (rain dilution or industrial surge), not PCR irreproducibility.

C) Failure rates were moderate (~20–30 %); non-detect flags and extreme PMMoV outliers contributed equally, suggesting intermittent extraction failure rather than any single process breakdown.

D) Nearly all samples passed QC; the few failures were driven primarily by below-LOD non-detects at the very beginning of the window before viral concentrations rose to reliably detectable levels.

Step 2 — Concentration Transformation

Rationale

Raw ddPCR viral RNA concentrations in wastewater span several orders of magnitude (~10²–10⁶ copies/L) and are right-skewed. For downstream regression, baseline estimation, and epidemiological trend detection, a log₁₀ transformation is standard in the WBE literature (Karthikeyan et al. 2022; CDC NWSS). This:

Stabilises variance across the dynamic range
Linearises exponential growth/decay into slopes interpretable as doubling/halving times
Makes the distribution approximately normal, validating parametric methods

We also compute a 7-day centred rolling median (robust to single outliers) for trend visualisation, consistent with the rolling-average approach used in the Karthikeyan et al. paper.

qpcr_t <- qpcr_clean %>%
  mutate(
    log10_conc = log10(pcr_target_avg_conc),
    # 7-day centred rolling median (robust smoother)
    roll7_median = slide_dbl(log10_conc, median, na.rm = TRUE,
                             .before = 3, .after = 3),
    # 7-day rolling mean (for WVAL baseline later)
    roll7_mean   = slide_dbl(log10_conc, mean, na.rm = TRUE,
                             .before = 3, .after = 3)
  )

ggplot(qpcr_t, aes(date)) +
  geom_point(aes(y = log10_conc), colour = "grey60", size = 1.4, alpha = 0.7) +
  geom_line(aes(y = roll7_median), colour = "steelblue", linewidth = 1.1) +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  labs(
    title    = "SARS-CoV-2 RNA concentration over time (log₁₀-transformed)",
    subtitle = "Points = daily ddPCR; line = 7-day centred rolling median",
    x = NULL, y = "log₁₀ copies / L"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# Assess normality before vs after transform
p_raw <- ggplot(qpcr_clean, aes(pcr_target_avg_conc)) +
  geom_histogram(bins = 30, fill = "steelblue", colour = "white") +
  labs(title = "Raw concentration", x = "copies/L", y = "Count")

p_log <- ggplot(qpcr_t, aes(log10_conc)) +
  geom_histogram(bins = 30, fill = "darkorange", colour = "white") +
  stat_function(fun = function(x)
    dnorm(x, mean(qpcr_t$log10_conc), sd(qpcr_t$log10_conc)) * nrow(qpcr_t) *
      diff(range(qpcr_t$log10_conc)) / 30,
    colour = "black", linewidth = 1) +
  labs(title = "log₁₀ concentration", x = "log₁₀ copies/L", y = "Count")

p_raw + p_log

Step 2 — MCQ

After applying log₁₀ transformation to the raw ddPCR concentration data from Point Loma, what does the shape of the log-transformed distribution and the trajectory of the 7-day rolling median together suggest about SARS-CoV-2 circulation dynamics during this five-month window?

A) The log-transformed values are uniformly distributed and the rolling median is flat, indicating stable endemic transmission with no significant waves during October 2021 – February 2022.

B) The distribution is bimodal and the rolling median shows two distinct elevated plateaus of equal magnitude, consistent with two independent epidemic waves of similar peak intensity.

C) The log-transformed values approach normality and the rolling median reveals at least one major ascending phase reaching a clear peak, followed by evidence of a secondary rise — consistent with sequential variant waves (Delta tail → Omicron emergence) rather than stable endemic transmission.

D) The transformation yields a strongly left-skewed distribution, indicating that most samples captured the peak of a single dominant wave with rapid post-peak decline.

Step 3 — Baseline & Activity Level Classification (CDC WVAL)

Rationale

The CDC Wastewater Viral Activity Level (WVAL) standardises site-level measurements against a historical baseline so that activity levels can be compared across sites and time. The core calculation is:

\[\text{WVAL} = \frac{C_{\text{recent}} - \mu_{\text{baseline}}}{\sigma_{\text{baseline}}}\]

where \(C_{\text{recent}}\) is a smoothed (7-day median) log₁₀ concentration, and \(\mu\) and \(\sigma\) are the mean and standard deviation computed over a baseline window of low-activity periods.

For a 5-month window, we define the baseline as the first 8 weeks of data (the minimum CDC requires before publishing a WVAL), which in this dataset corresponds to October–November 2021 — a period that preceded the Omicron surge and therefore represents a relatively low baseline consistent with Delta wave dynamics. The WVAL thresholds used (Minimal/Low/Moderate/High/Very High) are then derived from the baseline distribution.

# ── Baseline window: first 8 weeks ───────────────────────────────────────────
baseline_end <- min(qpcr_t$date) + weeks(8)

baseline_stats <- qpcr_t %>%
  filter(date <= baseline_end) %>%
  summarise(
    bl_mean = mean(log10_conc, na.rm = TRUE),
    bl_sd   = sd(log10_conc,   na.rm = TRUE)
  )

cat("Baseline mean (log₁₀):", round(baseline_stats$bl_mean, 3),
    "\nBaseline SD  (log₁₀):", round(baseline_stats$bl_sd,   3), "\n")

## Baseline mean (log₁₀): 5.876 
## Baseline SD  (log₁₀): 0.284

# ── WVAL score ────────────────────────────────────────────────────────────────
qpcr_wval <- qpcr_t %>%
  mutate(
    wval_score = (roll7_median - baseline_stats$bl_mean) / baseline_stats$bl_sd
  )

# ── Classify activity levels (CDC-inspired thresholds in standardised units) ──
# CDC's thresholds map approximately to these z-score ranges (2024 framework):
# Minimal : wval_score < 0
# Low     : 0   <= wval_score < 1
# Moderate: 1   <= wval_score < 2
# High    : 2   <= wval_score < 3
# Very High: wval_score >= 3

wval_levels <- c("Minimal", "Low", "Moderate", "High", "Very High")
wval_colours <- c("Minimal"   = "#2166AC",
                  "Low"       = "#74ADD1",
                  "Moderate"  = "#FEE090",
                  "High"      = "#F46D43",
                  "Very High" = "#A50026")

qpcr_wval <- qpcr_wval %>%
  mutate(
    activity_level = case_when(
      wval_score <  0 ~ "Minimal",
      wval_score <  1 ~ "Low",
      wval_score <  2 ~ "Moderate",
      wval_score <  3 ~ "High",
      TRUE            ~ "Very High"
    ),
    activity_level = factor(activity_level, levels = wval_levels)
  )

# Ribbon background by activity level
# Build date ranges for each level for background shading
activity_runs <- qpcr_wval %>%
  select(date, activity_level) %>%
  mutate(run_id = cumsum(activity_level != lag(activity_level,
                                               default = first(activity_level))))

activity_ranges <- activity_runs %>%
  group_by(run_id, activity_level) %>%
  summarise(start = min(date), end = max(date), .groups = "drop")

p_wval <- ggplot(qpcr_wval, aes(date)) +
  # Background shading per activity level
  geom_rect(data = activity_ranges,
            aes(xmin = start - 0.5, xmax = end + 0.5,
                ymin = -Inf, ymax = Inf, fill = activity_level),
            alpha = 0.25, inherit.aes = FALSE) +
  # Smoothed signal
  geom_line(aes(y = wval_score), colour = "black", linewidth = 0.9) +
  geom_point(aes(y = wval_score, colour = activity_level),
             size = 1.5, alpha = 0.7) +
  # Threshold lines
  geom_hline(yintercept = c(0, 1, 2, 3), linetype = "dashed",
             colour = "grey40", linewidth = 0.4) +
  annotate("text", x = max(qpcr_wval$date), y = c(-0.2, 0.5, 1.5, 2.5, 3.5),
           label = wval_levels, hjust = 1, size = 3, colour = "grey30") +
  scale_fill_manual(values   = wval_colours, guide = "none") +
  scale_colour_manual(values = wval_colours, name = "Activity level") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  labs(
    title    = "CDC WVAL — Point Loma WWTP",
    subtitle = "Baseline: first 8 weeks (Oct–Nov 2021); z-score relative to baseline",
    x = NULL, y = "WVAL score (z-score)"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "top")

p_wval

qpcr_wval %>%
  count(activity_level) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  kable(caption = "Table 2. Days per CDC activity level") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 2. Days per CDC activity level
activity_level	n	pct
Minimal	9	15.5
Low	11	19.0
Moderate	5	8.6
High	5	8.6
Very High	28	48.3

Step 3 — MCQ

After applying CDC WVAL methodology with a baseline window anchored to the first 8 weeks of the surveillance record (Oct–Nov 2021), which statement best characterises the activity-level trajectory at Point Loma WWTP across the full five-month window?

A) Sustained Very Low / Minimal activity throughout — no period exceeded the Low threshold, suggesting SARS-CoV-2 transmission was controlled for the entire window.

B) A single transient spike to High lasting fewer than two weeks, otherwise Low / Moderate — consistent with a brief outbreak that was rapidly contained without a sustained wave.

C) A multi-week ramp from Minimal / Low through Moderate to a Very High peak, with the escalation concentrated in December 2021 – January 2022, followed by a declining trajectory toward High / Moderate by late February — consistent with the Omicron BA.1 wave superimposed on residual Delta signal.

D) Bimodal — two distinct Very High waves separated by a return to Minimal, reflecting two fully independent epidemic introductions each starting from near-zero baseline.

Step 4 — Trend Detection

Rationale

Trend detection serves two goals: (1) confirming the directionality of activity-level transitions detected in Step 3, and (2) providing an analytically robust lead-time estimate relative to clinical surveillance. We use two complementary methods:

Method	What it tests
Mann-Kendall trend test	Non-parametric test for monotonic trend in the full series; does not assume normality or linearity
Changepoint detection (PELT algorithm)	Identifies structural breakpoints in the mean and/or variance of the series — marking when the trend direction changed

The PELT (Pruned Exact Linear Time) algorithm from the changepoint package detects changepoints in the mean of the log₁₀ concentration series under a penalty chosen by BIC, which balances sensitivity to real breaks against false positives.

# ── Mann-Kendall over the full QC-pass series ─────────────────────────────────
mk_result <- mk.test(qpcr_wval$log10_conc)
print(mk_result)

## 
##  Mann-Kendall trend test
## 
## data:  qpcr_wval$log10_conc
## z = 6.527, n = 58, p-value = 6.709e-11
## alternative hypothesis: true S is not equal to 0
## sample estimates:
##           S        varS         tau 
##   974.00000 22222.66667     0.58941

# ── Mann-Kendall in first half (Oct–Dec) vs second half (Jan–Feb) ─────────────
mk_first  <- mk.test(filter(qpcr_wval, date < as_date("2022-01-01"))$log10_conc)
mk_second <- mk.test(filter(qpcr_wval, date >= as_date("2022-01-01"))$log10_conc)

tibble(
  Period    = c("Full window (Oct–Feb)", "Oct–Dec 2021", "Jan–Feb 2022"),
  tau       = c(mk_result$statistic, mk_first$statistic, mk_second$statistic),
  p_value   = c(mk_result$p.value,   mk_first$p.value,   mk_second$p.value),
  direction = ifelse(tau > 0, "Increasing", "Decreasing")
) %>%
  mutate(across(where(is.numeric), ~round(.x, 4))) %>%
  kable(caption = "Table 3. Mann-Kendall trend tests") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 3. Mann-Kendall trend tests
Period	tau	direction
Full window (Oct–Feb)	6.5270	Increasing
Oct–Dec 2021	6.8316	Increasing
Jan–Feb 2022	-5.9159	Decreasing

# ── PELT changepoint detection on the smoothed log₁₀ series ──────────────────
cpt_series <- qpcr_wval$roll7_median
cpt_result <- cpt.mean(cpt_series, method = "PELT", penalty = "BIC",
                       minseglen = 7)  # minimum 7 days per segment

# Extract changepoint indices and map to dates
cpt_indices <- cpts(cpt_result)
cpt_dates   <- qpcr_wval$date[cpt_indices]

cat("Changepoint dates detected:\n")

## Changepoint dates detected:

print(cpt_dates)

## [1] "2021-12-13"

# Segment means
seg_means <- param.est(cpt_result)$mean

# Vertical lines at changepoints + segment means overlaid
cpt_df <- tibble(date = cpt_dates)

ggplot(qpcr_wval, aes(date)) +
  geom_point(aes(y = log10_conc), colour = "grey70", size = 1.2, alpha = 0.6) +
  geom_line(aes(y = roll7_median), colour = "steelblue", linewidth = 1) +
  geom_vline(data = cpt_df, aes(xintercept = date),
             linetype = "dashed", colour = "firebrick", linewidth = 0.8) +
  geom_label_repel(data = cpt_df,
                   aes(x = date, y = max(qpcr_wval$log10_conc, na.rm = TRUE) - 0.3,
                       label = format(date, "%b %d")),
                   colour = "firebrick", size = 3.2) +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  labs(
    title    = "Changepoint detection (PELT-BIC) on 7-day rolling median log₁₀ concentration",
    subtitle = "Red dashed lines = structural changepoints; blue = smoothed signal",
    x = NULL, y = "log₁₀ copies / L"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Step 4 — MCQ

Applying Mann-Kendall testing and PELT changepoint detection to the log₁₀-transformed concentration series from Point Loma, which description best characterises the directionality and structural transitions of the wastewater signal?

A) A significant positive monotonic trend across the entire five-month window with no structural breakpoints — indicating uninterrupted exponential growth throughout.

B) No statistically significant trend in any sub-period; the detected changepoints reflect only technical noise, not genuine epidemiological transitions.

C) An initial non-significant or weakly negative trend (Oct–Dec) followed by a significant positive trend (Dec–Jan) and a subsequent structural break marking a declining phase (Jan–Feb) — changepoints correspond to the onset and peak of the Omicron BA.1 surge.

D) A significant negative monotonic trend across the full window driven by consistent decline throughout; changepoints are absent because the wave had already peaked before October 2021.

Step 5 — Variant Deconvolution & Lineage Dynamics

Rationale

Freyja uses a lineage-weighted least-squares approach to deconvolute the mixed amplicon sequencing signal in wastewater into relative abundances of SARS-CoV-2 Pango lineages. The residual column measures the fraction of reads unexplained by the current reference set — high residuals warrant caution.

For this analysis we:

Filter timepoints with high residuals (> 0.1) as low-confidence deconvolutions
Collapse the 50+ lineage columns into WHO variant groups (Delta, Omicron BA.1, Omicron BA.2, Other) for clarity
Generate a stacked-area plot of daily lineage dynamics

Lineage-to-variant mapping (based on Pango nomenclature for this period):

Delta: AY.*, B.1.617.2
Omicron BA.1: BA.1, BA.1.*, B.1.1.529
Omicron BA.2: BA.2, BA.2.*
Other / Unclassified: everything else

# ── Identify lineage abundance columns (not date/site/residual) ───────────────
meta_cols <- c("date", "site", "residual")
lineage_cols <- setdiff(names(freyja), meta_cols)

# ── Long format ───────────────────────────────────────────────────────────────
freyja_long <- freyja %>%
  filter(!is.na(residual), residual <= 0.1) %>%   # quality filter
  pivot_longer(cols = all_of(lineage_cols),
               names_to  = "lineage",
               values_to = "abundance") %>%
  filter(!is.na(abundance), abundance > 0)

# ── Assign variant groups ─────────────────────────────────────────────────────
freyja_long <- freyja_long %>%
  mutate(
    variant_group = case_when(
      str_detect(lineage, "^AY\\.|^B\\.1\\.617\\.2") ~ "Delta",
      str_detect(lineage, "^BA\\.1|^B\\.1\\.1\\.529")~ "Omicron BA.1",
      str_detect(lineage, "^BA\\.2")                 ~ "Omicron BA.2",
      TRUE                                            ~ "Other"
    )
  )

# ── Aggregate to variant-group level per day ──────────────────────────────────
freyja_variant <- freyja_long %>%
  group_by(date, variant_group) %>%
  summarise(abundance = sum(abundance, na.rm = TRUE), .groups = "drop") %>%
  # Renormalise per day to sum to 1
  group_by(date) %>%
  mutate(abundance = abundance / sum(abundance)) %>%
  ungroup()

variant_colours <- c(
  "Delta"        = "#E41A1C",
  "Omicron BA.1" = "#377EB8",
  "Omicron BA.2" = "#4DAF4A",
  "Other"        = "#999999"
)

ggplot(freyja_variant,
       aes(date, abundance, fill = variant_group)) +
  geom_area(alpha = 0.85, colour = "white", linewidth = 0.2, position = "stack") +
  scale_fill_manual(values = variant_colours, name = "Variant") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title    = "SARS-CoV-2 variant composition in wastewater (Freyja deconvolution)",
    subtitle = "Point Loma WWTP | residual-filtered (≤ 0.10)",
    x = NULL, y = "Relative abundance"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "top")

# Identify when Omicron BA.1 first exceeded Delta
crossover <- freyja_variant %>%
  pivot_wider(names_from = variant_group, values_from = abundance,
              values_fill = 0) %>%
  filter(`Omicron BA.1` > Delta) %>%
  slice_min(date, n = 1)

cat("Omicron BA.1 first exceeded Delta in wastewater on:",
    format(crossover$date, "%B %d, %Y"), "\n")

## Omicron BA.1 first exceeded Delta in wastewater on: December 12, 2021

Step 5 — MCQ

Based on Freyja deconvolution of the wastewater metagenome at Point Loma, which statement best describes the variant succession dynamics during October 2021 – February 2022?

A) Delta maintained dominance (> 50 % abundance) through the end of February 2022; Omicron was detected only as sporadic low-level contamination below 10 % at any point.

B) Omicron BA.1 and BA.2 emerged simultaneously in early December and co-dominated the community from that point forward, with Delta absent by mid-December.

C) Delta dominated from October through mid-to-late November; Omicron BA.1 then rose rapidly from a cryptic minority signal to community dominance by December–January, followed by early evidence of BA.2 expansion — detectable in wastewater ahead of clinical confirmation.

D) The deconvolution residuals were uniformly high (> 0.3) throughout, rendering all lineage assignments unreliable; no meaningful variant dynamics can be inferred from this dataset.

Step 6 — Emerging / Novel Lineage Detection

Rationale

A core biosurveillance function beyond tracking known variants is early detection of novel or rising lineages before they reach dominance. We operationalise this as:

Residual tracking: a rising Freyja residual can indicate reads from lineages not in the current reference set — the first genomic signal of a truly novel variant
Low-abundance lineage monitoring: track lineages that are present at < 5 % abundance but show a statistically significant positive trend using Mann-Kendall — these are candidates for emerging variants before they cross a detection threshold
First-detection dating: for each major variant group, identify the first date it was detectable above a 1 % abundance threshold in wastewater

freyja_residual <- freyja %>%
  filter(!is.na(residual)) %>%
  select(date, residual) %>%
  arrange(date)

# Mann-Kendall on residual series
mk_residual <- mk.test(freyja_residual$residual)
cat("Mann-Kendall on Freyja residual — tau:", round(mk_residual$statistic, 3),
    "  p =", round(mk_residual$p.value, 4), "\n")

## Mann-Kendall on Freyja residual — tau: 1.239   p = 0.2155

ggplot(freyja_residual, aes(date, residual)) +
  geom_point(colour = "darkorange", size = 1.8, alpha = 0.7) +
  geom_smooth(method = "loess", span = 0.3, colour = "black", linewidth = 0.9,
              se = TRUE, fill = "grey80") +
  geom_hline(yintercept = 0.1, linetype = "dashed", colour = "firebrick") +
  annotate("text", x = max(freyja_residual$date), y = 0.11,
           label = "0.10 QC threshold", hjust = 1, size = 3, colour = "firebrick") +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  labs(
    title    = "Freyja deconvolution residual over time",
    subtitle = "Rising residual can signal emergence of lineages absent from reference set",
    x = NULL, y = "Residual (unexplained fraction)"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# First detection (> 1 % abundance) for each variant group
first_detections <- freyja_variant %>%
  filter(abundance > 0.01) %>%
  group_by(variant_group) %>%
  slice_min(date, n = 1) %>%
  ungroup() %>%
  arrange(date)

first_detections %>%
  select(variant_group, first_detection_date = date,
         abundance_at_detection = abundance) %>%
  mutate(abundance_at_detection = percent(abundance_at_detection, accuracy = 0.1)) %>%
  kable(caption = "Table 4. First wastewater detection by variant group (> 1 % threshold)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 4. First wastewater detection by variant group (> 1 % threshold)
variant_group	first_detection_date	abundance_at_detection
Delta	2021-10-04	98.3%
Other	2021-10-04	1.7%
Omicron BA.1	2021-11-28	2.6%
Omicron BA.2	2022-02-22	7.5%

# Identify individual low-abundance lineages with rising trends
# Focus on lineages that first appear at < 5 % and trend upward

emerging <- freyja_long %>%
  # Keep lineages with at least 5 time points
  group_by(lineage) %>%
  filter(n() >= 5, mean(abundance, na.rm = TRUE) < 0.05) %>%
  summarise(
    first_seen    = min(date),
    last_seen     = max(date),
    n_timepoints  = n(),
    mean_abund    = mean(abundance, na.rm = TRUE),
    mk_tau        = mk.test(abundance)$statistic,
    mk_p          = mk.test(abundance)$p.value,
    .groups       = "drop"
  ) %>%
  filter(mk_p < 0.05, mk_tau > 0) %>%   # significant positive trend
  arrange(mk_p) %>%
  slice_head(n = 15)

emerging %>%
  mutate(across(where(is.numeric), ~round(.x, 4))) %>%
  kable(caption = "Table 5. Low-abundance lineages with significant rising trends (p < 0.05, Mann-Kendall)") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 5. Low-abundance lineages with significant rising trends (p < 0.05, Mann-Kendall)
lineage	first_seen	last_seen	n_timepoints	mean_abund	mk_tau	mk_p
AY.119	2021-10-04	2021-12-20	11	0.0302	2.5064	0.0122

Step 6 — MCQ

Examining first-detection dates, Freyja residual dynamics, and low-abundance lineage trends, which statement best describes the wastewater surveillance system’s capacity for early variant warning at Point Loma during this window?

A) Wastewater sequencing provided no lead time over clinical surveillance: Omicron BA.1 was detected in wastewater only after it had already reached community dominance in clinical case data.

B) Omicron BA.1 appeared in wastewater at detectable but minority abundance (< 10–20 %) while Delta still dominated — and this signal preceded widespread clinical confirmation — demonstrating that wastewater sequencing can detect variant emergence cryptically, before community-level spread is apparent in case reports.

C) The Freyja residual was the primary early warning signal: it exceeded 0.5 for multiple weeks before any Omicron lineage exceeded 1 % abundance, indicating that the reference set failure (not positive abundance calls) was the actionable sentinel.

D) Low-abundance lineage tracking was uninformative because all individual Pango lineages either rose directly to > 50 % or were absent; no lineage was detectable in the 1–20 % range for more than three days.

Step 7 — Clinical Signal Integration & Public Health Interpretation

Rationale

Wastewater-based epidemiology is most powerful when integrated with clinical surveillance data because:

Lead time: WBE captures community-wide shedding (including asymptomatic and pre-symptomatic individuals) before those individuals seek testing
Signal proportionality: wastewater concentration should correlate with case counts but without testing-rate bias
Variant early warning: as shown in Step 6, variant composition in wastewater can shift before clinical genotyping data accumulate

We integrate the weekly reported San Diego County case data (cases.csv) with the 7-day rolling wastewater signal, aligning on the weekly-ending date, and compute:

Pearson correlation (log₁₀ conc vs log₁₀ cases) at lags 0, +7, +14 days (positive lag = wastewater leads cases)
Cross-correlation function (CCF) to identify the optimal lead time

# ── Aggregate wastewater to weekly averages (align with case cadence) ─────────
ww_weekly <- qpcr_wval %>%
  mutate(week_ending = ceiling_date(date, "week") - days(1)) %>%
  group_by(week_ending) %>%
  summarise(
    ww_log10_mean  = mean(log10_conc,  na.rm = TRUE),
    ww_log10_med   = median(log10_conc, na.rm = TRUE),
    ww_wval_mean   = mean(wval_score,   na.rm = TRUE),
    n_samples      = n(),
    .groups        = "drop"
  )

# ── Join with case data ───────────────────────────────────────────────────────
combined <- cases %>%
  inner_join(ww_weekly, by = "week_ending") %>%
  filter(!is.na(weekly_cases), weekly_cases > 0) %>%
  mutate(log10_cases = log10(weekly_cases))

cat("Weeks with matched WW + clinical data:", nrow(combined), "\n")

## Weeks with matched WW + clinical data: 22

# Dual-axis: wastewater (left) + cases (right)
cases_scale <- max(combined$log10_cases, na.rm = TRUE) /
               max(combined$ww_log10_mean, na.rm = TRUE)

ggplot(combined, aes(week_ending)) +
  geom_col(aes(y = log10_cases / cases_scale), fill = "#FDAE61",
           alpha = 0.6, width = 5) +
  geom_line(aes(y = ww_log10_mean), colour = "steelblue",
            linewidth = 1.2) +
  geom_point(aes(y = ww_log10_mean), colour = "steelblue", size = 2) +
  scale_y_continuous(
    name = "Wastewater log₁₀ concentration (copies/L)",
    sec.axis = sec_axis(~ . * cases_scale, name = "log₁₀ weekly cases")
  ) +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  labs(
    title    = "Wastewater signal vs clinical COVID-19 cases — San Diego County",
    subtitle = "Blue line = WW log₁₀ concentration; Orange bars = log₁₀ weekly reported cases",
    x = NULL
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        axis.title.y.right = element_text(colour = "#FDAE61"),
        axis.title.y.left  = element_text(colour = "steelblue"))

# ── Cross-correlation function (WW leads cases at positive lags) ───────────────
# Interpolate to fill any date gaps
full_dates <- tibble(week_ending = seq(min(combined$week_ending),
                                       max(combined$week_ending),
                                       by = "week"))
combined_full <- full_dates %>%
  left_join(combined, by = "week_ending") %>%
  mutate(across(c(ww_log10_mean, log10_cases), ~na.approx(.x, na.rm = FALSE)))

ccf_result <- ccf(
  combined_full$ww_log10_mean,
  combined_full$log10_cases,
  lag.max = 6,
  plot    = FALSE,
  na.action = na.pass
)

ccf_df <- tibble(
  lag_weeks = ccf_result$lag[,,1],
  lag_days  = lag_weeks * 7,
  acf       = ccf_result$acf[,,1]
)

ggplot(ccf_df, aes(lag_days, acf)) +
  geom_col(fill = "steelblue", alpha = 0.7, width = 5) +
  geom_hline(yintercept = c(-1, 1) * qnorm(0.975) / sqrt(ccf_result$n.used),
             linetype = "dashed", colour = "firebrick") +
  geom_vline(xintercept = 0, colour = "grey40") +
  labs(
    title    = "Cross-correlation: wastewater log₁₀ concentration vs log₁₀ weekly cases",
    subtitle = "Positive lag (right of zero) = wastewater leads cases",
    x = "Lag (days; positive = WW leads cases)", y = "Cross-correlation"
  )

# Pearson correlations at specified lags (WW leads by 0, 7, 14 days)
cor_lags <- tibble(
  lag_days  = c(0, 7, 14),
  lag_weeks = lag_days / 7
) %>%
  mutate(pearson_r = map_dbl(lag_weeks, function(lag_w) {
    ww_shifted <- combined_full %>%
      mutate(ww_shifted = lag(ww_log10_mean, n = as.integer(lag_w)))
    cor(ww_shifted$ww_shifted, ww_shifted$log10_cases,
        use = "complete.obs", method = "pearson")
  }))

cor_lags %>%
  mutate(pearson_r = round(pearson_r, 3)) %>%
  kable(caption = "Table 6. Pearson correlation at WW lead lags of 0, 7, 14 days") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Table 6. Pearson correlation at WW lead lags of 0, 7, 14 days
lag_days	lag_weeks	pearson_r
0	0	0.813
7	1	0.657
14	2	0.472

# Overlay variant turnover with case trajectory
variant_weekly <- freyja_variant %>%
  mutate(week_ending = ceiling_date(date, "week") - days(1)) %>%
  group_by(week_ending, variant_group) %>%
  summarise(abundance = mean(abundance, na.rm = TRUE), .groups = "drop") %>%
  group_by(week_ending) %>%
  mutate(abundance = abundance / sum(abundance)) %>%
  ungroup()

p_vt <- ggplot(variant_weekly, aes(week_ending, abundance, fill = variant_group)) +
  geom_area(position = "stack", alpha = 0.8, colour = "white", linewidth = 0.2) +
  scale_fill_manual(values = variant_colours, name = NULL) +
  scale_y_continuous(labels = percent_format()) +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  labs(title = "Weekly variant composition", x = NULL, y = "Proportion") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1), legend.position = "top")

p_cases <- ggplot(combined, aes(week_ending, weekly_cases)) +
  geom_col(fill = "#FDAE61", alpha = 0.8) +
  scale_x_date(date_breaks = "2 weeks", date_labels = "%b %d") +
  scale_y_continuous(labels = comma) +
  labs(title = "Weekly reported COVID-19 cases (San Diego County)",
       x = NULL, y = "Weekly cases") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

p_vt / p_cases

Step 7 — MCQ

Integrating the wastewater viral load trajectory, Freyja variant composition dynamics, and weekly clinical case counts, which statement best captures the public health value of wastewater surveillance at Point Loma during this window?

A) Wastewater and clinical case signals rose and fell in perfect synchrony (zero-lag peak correlation), and variant succession in wastewater mirrored clinical genotyping data simultaneously — indicating WBE added no meaningful lead time advantage over case-based surveillance.

B) The wastewater viral load signal led the clinical case surge by approximately one to two weeks, and Omicron BA.1 was detectable in wastewater at minority abundance while Delta still dominated clinical genotyping reports — demonstrating that integrated WBE provides earlier warning of both epidemic intensity and variant emergence than case-based surveillance alone.

C) The clinical case signal consistently led the wastewater signal by approximately three weeks, suggesting that symptomatic individuals sought testing and drove clinical counts before their wastewater shedding became measurable at the sewershed level.

D) The cross-correlation was not significant at any lag, indicating the wastewater signal and case counts tracked entirely independent phenomena (e.g., non-infectious shedding, industrial contamination) with no epidemiological connection.

Session Information

sessionInfo()

## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3 viridis_0.6.5      viridisLite_0.4.3  ggrepel_0.9.8     
##  [5] patchwork_1.3.2    scales_1.4.0       kableExtra_1.4.0   knitr_1.51        
##  [9] broom_1.0.12       trend_1.1.6        changepoint_2.3    TTR_0.24.4        
## [13] slider_0.3.3       zoo_1.8-15         lubridate_1.9.5    forcats_1.0.1     
## [17] stringr_1.6.0      dplyr_1.2.1        purrr_1.2.2        readr_2.2.0       
## [21] tidyr_1.3.2        tibble_3.3.1       ggplot2_4.0.3      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        xfun_0.57           bslib_0.10.0       
##  [4] lattice_0.22-9      tzdb_0.5.0          vctrs_0.7.3        
##  [7] tools_4.6.0         generics_0.1.4      parallel_4.6.0     
## [10] curl_7.1.0          xts_0.14.2          pkgconfig_2.0.3    
## [13] Matrix_1.7-5        S7_0.2.2            lifecycle_1.0.5    
## [16] compiler_4.6.0      farver_2.1.2        textshaping_1.0.5  
## [19] htmltools_0.5.9     sass_0.4.10         yaml_2.3.12        
## [22] crayon_1.5.3        pillar_1.11.1       jquerylib_0.1.4    
## [25] cachem_1.1.0        nlme_3.1-169        tidyselect_1.2.1   
## [28] digest_0.6.39       stringi_1.8.7       splines_4.6.0      
## [31] labeling_0.4.3      fastmap_1.2.0       grid_4.6.0         
## [34] cli_3.6.6           magrittr_2.0.5      withr_3.0.2        
## [37] backports_1.5.1     warp_0.2.3          bit64_4.8.0        
## [40] timechange_0.4.0    rmarkdown_2.31      extraDistr_1.10.0.3
## [43] bit_4.6.0           gridExtra_2.3       hms_1.1.4          
## [46] evaluate_1.0.5      mgcv_1.9-4          rlang_1.2.0        
## [49] Rcpp_1.1.1-1.1      glue_1.8.1          xml2_1.5.2         
## [52] svglite_2.2.2       rstudioapi_0.18.0   vroom_1.7.1        
## [55] jsonlite_2.0.0      R6_2.6.1            systemfonts_1.3.2

SARS-CoV-2 Wastewater Surveillance Analysis

Point Loma WWTP, San Diego CA — Oct 2021 to Feb 2022

Biosecurity Surveillance Engineer Candidate

2026-04-30

Preamble

Package dependencies

Data loading

Step 1 — Sample QC

Rationale

Step 1 — MCQ

Step 2 — Concentration Transformation

Rationale

Step 2 — MCQ

Step 3 — Baseline & Activity Level Classification (CDC WVAL)

Rationale

Step 3 — MCQ

Step 4 — Trend Detection

Rationale

Step 4 — MCQ

Step 5 — Variant Deconvolution & Lineage Dynamics

Rationale

Step 5 — MCQ

Step 6 — Emerging / Novel Lineage Detection

Rationale

Step 6 — MCQ

Step 7 — Clinical Signal Integration & Public Health Interpretation

Rationale

Step 7 — MCQ

Session Information