Evaluating the 10% WQS exceedance threshold

Numeric water quality standards (WQS) are frequently implemented using a 10% exceedance threshold. That is, if 10% of samples from a waterbody exceed the numeric WQS, that waterbody is listed as impaired. An impairment listing can trigger substantial regulatory requirements. Understanding how the 10% exceedance rule behaves in practice is therefore critical to interpreting sample results. Statistical simulation can provide guidance.

Let’s create a ‘waterbody’ with a known water quality (WQ) parameter population. Environmental concentration data often follow a log-normal type distribution. We will use this distribution in our example to simulate the concentration of a WQ parameter. Let the natural log of a WQ parameter distribution equal ~ N(4,0.75), a normal distribution with a mean of 4 and standard deviation of 0.75. Let’s further assume that the natural log of the associated numeric WQS criteria is 5; a sample observation > 5 violates the numeric standard. If 10% of samples are > 5, the waterbody is impaired.

The following code visualizes this WQ parameter distribution and identifies the portion of the distribution above the WQS.

## load libraries
require(ggplot2); require(stringr); library(showtext)

## import plot font
font.add.google("Roboto", "my_rbt")

## set plot theme
theme_hc <- function(){
  theme(
    text                = element_text(family = "my_rbt", size = 16, color="gray40"),
    title               = element_text(size=11), 
    axis.title.x        = element_text(hjust = .5),
    axis.title.y        = element_text(hjust = .5),
    axis.ticks.y        = element_blank(),
    axis.ticks.x        = element_line(color = 'gray', size = 0.7),
    panel.grid.major.y  = element_line(color = 'gray', size = 0.3),
    panel.grid.minor.y  = element_blank(),
    panel.grid.major.x  = element_blank(),
    panel.grid.minor.x  = element_blank(),
    panel.border        = element_blank(),
    panel.background    = element_blank(),
    legend.position     = "bottom",
    legend.title        = element_blank(),
    legend.text         = element_text(color="gray40", size=12),
    legend.key          = element_rect(size = 2, fill=NA),
    legend.key.size     = unit(1, 'lines'),
    legend.key.width    = unit(3,'lines')
  )
}

## create WQ parameter distribution
x<-rnorm(10000, mean=4, sd=0.75)
df<-data.frame(x=x)

## calculate density and geom_ribbon points for plot
dens <- density(x)
wqs <- 5
q99 <- quantile(x, 0.99999)

## create density dataframe
dd <- with(dens,data.frame(x,y))

## plot
my_title <- expression(paste("Example Log-Normal WQ Parameter Distribution ", italic("~ N(4, 0.75)")))
showtext.auto(TRUE)
ggplot(df) + stat_density(aes(x=x), geom="line", size=0.4, col="gray") +
  geom_ribbon(data=subset(dd,x>wqs & x<q99),aes(x=x,ymax=y),ymin=0, ## shade area greater than standard
              fill="red",colour=NA,alpha=0.5) + labs(x="",y="", title=my_title, subtitle="WQS numeric criteria at 5.0") + theme_hc()

How frequently does our waterbody exceed the WQS (i.e., how often does a sample fall in the shaded portion of the distribution)?

1-pnorm(5, mean=4, sd=0.75)

## [1] 0.09121122

The waterbody meets the WQS; the exceedance probability is < 10%. However, what is the risk that this waterbody will be incorrectly listed as impaired? The null hypothesis in this case is that the waterbody is not impaired. In the context of hypothesis testing, incorrectly rejecting the null hypothesis is a Type I error. As we’ll see, the risk of committing a Type I error depends on the sampling approach.

Rather than test a single distribution and sampling regime, we can use R to simulate multiple exceedance thresholds and sampling approaches. First, we create a function to accomplish the following tasks:

Draw a set number of random samples (1, 10, 20, etc.).
Repeat step 1 n-number of times.
Iterate steps 1 and 2 for many different log-normal distributions, each of which has a known WQS exceedance probability.
For each unique combination of sample count and distribution, calculate the percent of WQS exceedances.

Since the example distributions we’ll create all have < 10% probability of exceeding the WQS, the output from Step 4 is the risk of committing a Type I error. For instance: if we repeatedly take 10 samples from ~ N(4,0.75), how frequently is the waterbody listed as impaired? Since we know this distribution does not exceed the WQS more than 10% of the time, this frequency is the risk of Type I error.

(Technical aside: functionals and vectorization are great, but the computational costs of a for loop in R are not always substantial, and the code is more readable. However, at the end of the post I’ve also implemented the simulation function below using mapply(), which is more concise and considerably faster.)

## set simulation function
sim<-function(x){ 

## to get a stable estimate, we need a large number of trials 
iter=10000 # number of iterations for each sampling regime

sp<-c(1,seq(10,120, by=10)) ## specify sampling regimes to test; single sample, 10 samples, etc.
violation <- numeric(length=iter) ## create empty vector for results
foo <- as.data.frame(matrix(nrow=length(sp), ncol=2)); colnames(foo) <- c("n", "T1") ## create dataframe for output
  
for (i in 1:length(sp)) {
  for (j in 1:iter){ 
  violation[j] <-sum(rnorm(sp[i],x$mu,0.75)>3 ) > sp[i]/10 ## percent of sample sets that indicate impairment
  foo[i,1] <- sp[i]
  foo[i,2] <- mean(violation)
  }
}
return(foo)
}

The sampling regimes to be tested are given by the vector sp:

sp<-c(1,seq(10,120, by=10))
sp

##  [1]   1  10  20  30  40  50  60  70  80  90 100 110 120

1 sample, 10 samples, 20 samples…

Now let’s create a series of distributions and calculate the probability of seeing a numeric WQS exceedance (value > 5). We will use lapply() to apply our function to each distribution.

## create data; specify mean and sd
df<-data.frame(case=letters[1:6], mu=c(1.7, 1.8, 1.85, 1.9, 1.95, 2.0))
df$p90<-round(1-pnorm(3, df$mu, 0.75),3)
df

##   case   mu   p90
## 1    a 1.70 0.042
## 2    b 1.80 0.055
## 3    c 1.85 0.063
## 4    d 1.90 0.071
## 5    e 1.95 0.081
## 6    f 2.00 0.091

The exceedance probabilities range from 9.1% to 4.2%. All are < 10%.

## split, apply, combine
df2<-split(df, df$case) ## split data frame into list components by case

df3<-lapply(df2, sim) ## apply sim() to each list component

df4 <-data.frame(do.call(rbind, df3)) ## combine

df4$case<-str_extract(rownames(df4),"[[:letter:]]") ## reassign case names
rownames(df4)<-NULL
head(df4)

##    n     T1 case
## 1  1 0.0405    a
## 2 10 0.0609    a
## 3 20 0.0471    a
## 4 30 0.0323    a
## 5 40 0.0222    a
## 6 50 0.0193    a

## plot results
df4$case_name=as.factor(df4$case)

## rename factor levels for plot
levels(df4$case_name)[levels(df4$case_name)=="a"] <- "4.2%"
levels(df4$case_name)[levels(df4$case_name)=="b"] <- "5.5%"
levels(df4$case_name)[levels(df4$case_name)=="c"] <- "6.3%"
levels(df4$case_name)[levels(df4$case_name)=="d"] <- "7.1%"
levels(df4$case_name)[levels(df4$case_name)=="e"] <- "8.1%"
levels(df4$case_name)[levels(df4$case_name)=="f"] <- "9.1%"

## set plot theme
theme_hc2 <- function(){
  theme(
    text                = element_text(family = "my_rbt", size = 10, color="gray40"),
    title               = element_text(hjust = 0), 
    axis.title.x        = element_text(hjust = .5),
    axis.title.y        = element_text(hjust = .5),
    axis.ticks.y        = element_blank(),
    panel.grid.major.y  = element_line(color = 'gray', size = .3),
    panel.grid.minor.y  = element_blank(),
    panel.grid.major.x  = element_blank(),
    panel.grid.minor.x  = element_blank(),
    axis.ticks.x        = element_line(color = 'gray', size = 0.7),
    panel.border        = element_blank(),
    panel.background    = element_blank(),
    legend.position     = "right",
    legend.title        = element_blank(),
    legend.text         = element_text(color="gray40", size=8),
    legend.key          = element_rect(size = 2, fill=NA),
    legend.key.size     = unit(1, 'lines'),
    legend.key.width    = unit(2,'lines')
  )
}

## plot
showtext.auto(TRUE)
p1<-ggplot(df4) + geom_line(aes(x=n, y=T1, group=case_name, color=case_name), size=1.3) + 
  labs(title="The 10% Numeric WQS Exceedance Rule and False Impairment Determination", 
       subtitle="All population distributions < 10% exceedance",x="Number of Samples", y="Probability of False Determination") +  guides(linetype = guide_legend(override.aes = list(size = 1))) +
  scale_x_continuous(breaks=seq(0,120, by=10)) + theme_hc2() + guides(col = guide_legend(reverse = TRUE))
p1

For a compliant waterbody that exceeds the WQS 5.5% of the time, the risk of Type I error under a 10-sample sampling regime is approximately 10%.

df4[df4$case=="b" & df4$n==10,]

##     n     T1 case case_name
## 15 10 0.1021    b      5.5%

Interestingly, the impact of increasing sample size on the risk of Type I error varies by waterbody exceedance; the closer the waterbody is to exceeding the WQS, the less increasing the sampling frequency reduces the risk of a false impairment determination. The finding here is that sampling regimes should be carefully designed to minimize the risk of Type I error, especially for waterbodies suspected of being close to the 10% exceedance standard.

This simulation function could be easily modified to explore Type II error risks for waterbodies that exceed the 10% exceedance standard by relatively small margins; i.e., what is the risk that a waterbody that exceeds the WQS 15% of the time will be declared as non-impaired?

mapply() implementation of simulation function

Faster, more concise.

## reformulate simulation function for mapply()
sim2<-function(n,mu,std){
violation <- numeric(length=10000) ##create a numeric vector/simulation number
for (i in 1:10000) {
  violation[i] <- sum(rnorm(n, mu, std)>3 ) > n/10
  
}
mean(violation)}

## use mapply to apply function to vectors of sample regime, mu, and sd, respectively
ex<-mapply(sim2, rep(c(1,10,20,30,40,50,60,70,80,90,100,110,120),each=6), c(1.7,1.8,1.85,1.9,1.95,2), 0.75) 

## format output as dataframe; map output to case names as above
dd<-data.frame(out=ex, case=c('a','b','c','d','e','f'), samp=rep(c(1,10,20,30,40,50,60,70,80,90,100,110,120),each=6))
head(dd)

##      out case samp
## 1 0.0435    a    1
## 2 0.0583    b    1
## 3 0.0645    c    1
## 4 0.0693    d    1
## 5 0.0812    e    1
## 6 0.0933    f    1