Introduction

The City of Ames has been monitoring the South Skunk River on a weekly basis above and below its wastewater treatment system, the Water Pollution Control Facility, since 2003. There is a lot that could be done with this rich dataset, but we’ll look nitrate and total phosphorus from Jan 2011 to Sept 2020 to illustrate the value of long-term monitoring, and to demonstrate the power paired same-day observations for finding patterns amid the noise of weather variation. Pairing is an strategy that can boost the statistical power of smaller datasets as well.

The data

Thanks to Andy Curtis for providing the data, and for his insights.

For each date, the following table shows total phosphorus (P) and nitrate-N, in mg/L, for the site immediately upstream of the WPC, 0.3 miles downstream, and in the treated effluent from the plant.

# A pre-filtered dataset provided by Andy Curtis, Ames Water & Pollution Control. The dataset extends back to 2003,
# but I've limited it to ten years to simplify the concept, and avoid complications with a change to labotory 
# method for nitrogen.
library(tidyverse)
library(lubridate)
ames_data <- read_csv("data/ames_WPC_nutrients_2011-2020.csv", 
    col_types = cols(CollectionDate = col_date(format = "%m/%d/%Y"),
      Analyte = col_factor(levels = c("Nitrate Nitrogen as N", 
         "Total Phosphorus as P")), CollectionTime = col_skip(), 
        Description = col_factor(levels = c("River - Upstream", "River - 0.3 Downstream", "Final")), 
        LabID = col_skip()))
ames_data <- rename(ames_data, Site = Description) %>%
  # Parse date into year and month (3 letter abr) for later subgroup analysis
  mutate(Year = year(CollectionDate), Month = month(CollectionDate, label = TRUE, abbr = TRUE), Week = week(CollectionDate))
  
# Pivot so that each date is a row, and calculates difference between downstream and upstream sites
ames_paired <- ames_data  %>%
  select(Site, CollectionDate, Year, Month, Week, Analyte, Result) %>%
  # Recodes analyte and site with an abbreviation
  mutate(Analyte = fct_recode(Analyte,
                           "N" = "Nitrate Nitrogen as N",
                           "P" = "Total Phosphorus as P"),
         Site = fct_recode(Site,
                           "Upstream" = "River - Upstream",
                           "Downstream" = "River - 0.3 Downstream",
                           "Effluent" = "Final"
                           )) %>%
pivot_wider(names_from = c(Analyte, Site), values_from = Result) %>%
    mutate(P_Diff = P_Downstream - P_Upstream, N_Diff = N_Downstream - N_Upstream)

To facilitate interpretation, we’ll also bring in streamflow, measured at a USGS gage a few miles upstream of the plant, but below the confluence with Squaw Creek. Water quality samples are collected around 8:00AM, so we’ll limit the records to that time frame. To identify if there was a significant rain event in the past 24 hours that might have affected water quality, we’ll calculate the difference in flow from the previous day. Streamflow can be ranked and organized as exceedence probability.

# Imports USGS flow data, parses date, filters to 8:00 (matching time of water quality samples)
# and calculates difference between dates (to identify rain events)
USGS <- read_csv("data/skunk_discharge_ames.csv", 
     col_types = cols(datetime = col_datetime(format = "%m/%d/%Y %H:%M")))
USGS <- mutate(USGS, CollectionDate = date(datetime), Hour = hour(datetime), Min = minute(datetime)) %>%
filter(Hour == 8, Min == 0) %>%
mutate(ChangeFlow = discharge_cfs - lag(discharge_cfs))
ames_paired_flow <- select(USGS, CollectionDate, discharge_cfs, ChangeFlow) %>% right_join(ames_paired)
ames_paired

The impact of the WPC

Phosphorus in the South Skunk River above the WPC is normally below 1 mg/L. Total phosphorus in the effluent is normally between 3 and 6 mg/L. As a result, phosphorus concentrations increase below the plant. This effect is most pronounced during low flow conditions, when the effluent makes up a larger proportion of the water in the river. The difference becomes small as flow increases, for the follwoing reasons. Phosphorus upstream tends to increase with streamflow, probably due to the influence of runoff and erosion. Phosphorus in the effluent tends to decrease with streamflow, probably due infiltration of water into the sewer system that dilutes the waste stream in wet weather.

ggplot(data = ames_paired_flow, aes(x = discharge_cfs)) +
  geom_point(aes(y = P_Upstream), color = "orange") +
  labs(title = "Phosphorus upstream of the WPC versus streamflow", x = "Streamflow (cfs)", y = "Phosphorus (mg/L)") +
  theme_bw()

ggplot(data = ames_paired_flow, aes(x = discharge_cfs)) +
  geom_point(aes(y = P_Diff), color = "purple") +
  geom_point(aes(y = P_Effluent), color = "red") +
  labs(title = "Phosphorus in effluent versus streamflow", subtitle = "Purple = Increase in phosphorus concentrations downstream of WPC", x = "Streamflow (cfs)", y = "Phosphorus (mg/L)") +
  theme_bw()

Differences between sites

Phosphorus is higher downstream of the WPC plant, as we’d expect. Based on the full dataset, the mean is 0.29 mg/L higher.

ames_data %>%
  filter(Analyte == "Total Phosphorus as P", Site != "Final") %>%
  ggplot(aes(x = Site, group = Site, y = Result)) +
  geom_boxplot() +
  labs(y = "Total phosphorus (mg/L)")

summary(ames_paired_flow$P_Diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -2.2800  0.0000  0.0600  0.2911  0.2000  4.1000      54

That’s clear enough from a large dataset, but would we be able to see this with a smaller sample?

If instead of the entire record we look at single years, the difference between sites varies, ranging from 0.94 mg/L difference in 2013 to less than 0.05 mg/L difference in 2015, 2016, and 2018. Our confidence that downstream site is greater ranges from 99.9% in 2013 (p < 0.001), to only 76% in 2019 (p = 0.24), to only 3% in 2013 (p = 0.97).

ames_P_graph <- ames_data %>%
  filter(Analyte == "Total Phosphorus as P", Site != "Final")
ames_P_graph %>%
  group_by(Site, Year) %>%
  summarise(P_mean = mean(Result)) %>%
ggplot() +
  geom_col(aes(x = factor(Year), y = P_mean, fill = Site), position = "dodge2") +
    labs(x = "Year", y = "Total Phosphorus (mg/L)", title = "Yearly average, based on weekly samples")

# print("T-test for difference in phosphorus in 2013")
# ttest1 <- filter(ames_paired_flow, (Week + 2) %% 4 == 0, Year == 2013)
# t.test(ttest1$P_Downstream, ttest1$P_Upstream, conf.level = 0.90)
# print("T-test for difference in phosphorus in 2015")
# ttest1 <- filter(ames_paired_flow, (Week + 2) %% 4 == 0, Year == 2015)
# t.test(ttest1$P_Downstream, ttest1$P_Upstream, conf.level = 0.90)
# print("T-test for difference in phosphorus in 2019")
# ttest1 <- filter(ames_paired_flow, (Week + 2) %% 4 == 0, Year == 2019)
# t.test(ttest1$P_Downstream, ttest1$P_Upstream, conf.level = 0.90)

Water quality studies often face the frustrating situation that through luck of the draw, a year or two of data is less than conclusive. To get around this, we can take advantage of the power of pairing. Phosphorus can vary quite a bit day to day and month to month, but when it’s high upstream it’s usually high downstream and when it’s low upstream it’s usually low downstream.

ames_paired_flow %>%
  filter(Year == 2019) %>%
ggplot(aes(x = P_Upstream, y = P_Downstream)) +
  geom_point() +
  labs(title = "Total phosphorus in 2019, upstream vs downstream of WPC", subtitle = "When it's high upstream it's generally high downstream", x = "Phosphorus upstream of WPC (mg/L)", y = "Phosphorus downstream of the WPC (mg/L)")

Making apples-to-apples comparisons between data collected on the same day and similar conditions can help us detect smaller changes between sites. When weekly samples from 2019 are paired by day, it boosts the strength of the evidence from p = 0.24 to p < 0.001; that is to say, we can now be over 99.9% certain there is a difference. Expressed as a 90 percent confidence intervals, the downstream site is between 0.05 mg/L and 0.15 mg/L greater.

With only monthly samples, the 90 percent confidence intervals are wider (0.04 to 0.26 mg/L), but pairing still allows to detect a statistically significant increase, where otherwise results would have been inconclusive.

Correcting for flow conditions

Not yet completed…

The power of pairing for trend detection

If we’re interested in differences across time, same-day pairing can’t help us. However, if the bulk of the variation is seasonal, we can make apples-to-apples comparisons between samples collected during the same month of the year.

Nitrate has strong seasonal pattern, consistently high in May and June, consistently low in August, moderate in winter, and highly variable in fall.

ggplot(ames_paired_flow, aes(x = Month, y = N_Upstream, group = Month)) +
  geom_boxplot() +
  labs (y = "Nitrate-N (mg/L)", title = "Seasonal pattern of nitrate in South Skunk River")

Without any kind of seasonal adjustment, a linear regression falls flat.

ggplot(ames_paired_flow, aes(x = CollectionDate, y = N_Upstream)) +
  geom_point() +
  geom_line(alpha= 0.2) +
  labs(title = "Upstream of WPC", y = "Nitrate-N (mg/L)", x = "Date") +
  geom_smooth(method = lm)

# summary(lm(N_Upstream ~ CollectionDate, data = ames_paired_flow))

Separating out the main period of active tile flow (March-July) improves the fit, and makes the model statistically signficant. (Don’t ask for the r-squared, it’s still terrible).

However, there are limits to approach. To simulate data from “spring snapshot” monitoring events, we’ve pulled out the measurements from the third week in May of each year. You can’t really fit a trend line to this little data if there’s any year-to-year or day-to-day noise obscuring the trend. It is my view that infrequent volunteer monitoring is most best used for comparing sites and for engaging the public, not for understanding trends.

filter(ames_paired_flow, Week %in% c(13:28)) %>%
ggplot(aes(x = CollectionDate, y = N_Upstream)) +
  geom_point() +
  labs(title = "Upstream of WPC", y = "Nitrate-N (mg/L)", x = "Date") +
  geom_smooth(method = lm)

# lm_a <- filter(ames_paired_flow, Week %in% c(13:28))
#  summary(lm(N_Upstream ~ CollectionDate, data = lm_a))
filter(ames_paired_flow, Week == 21) %>%
ggplot(aes(x = CollectionDate, y = N_Upstream)) +
  geom_point() +
  labs(title = "Mid-May samples only", subtitle = "Simulating a Spring Snapshot", y = "Nitrate-N (mg/L)", x = "Date") +
  geom_smooth(method = lm)

#lm_b <- filter(ames_paired_flow, Week == 21)
#  summary(lm(N_Upstream ~ CollectionDate, data = lm_b))

This basic concept–making apples to apples comparison across seasons–is used in more sophisticated approaches to trend analysis such as the seasonal Kendall test, multiple linear regresison models, and Weighted Regression on Time, Discharge, and Season (WRTDS).