1 Introduction

This project aims to establish methods for statistical analysis with non-detects. Pesticide monitoring data often contain substantial amount of non-detects, which are concentrations that are below level of quantification. Traditional methods used substitutions or treat non-detects as zeros. These methods are problematic and could result in many problems such as failing to detect trends when there is or detecting trends that are artificially introduced by substitution. Helsel (2012) developed statistical methods that can perform many kinds of statistical analysis without substituting non-detects. These methods are included in a R package called NADA. Later in 2021, Helsel developed version 2 of the NADA package (NADA2). This document demonstrates a few examples of using the NADA and NADA2 R packages for analyzing pesticide monitoring data in SURF.

2 Importing bifenthrin monitoring data

bif = read.csv("D:\\Statistical_Learning\\NADA_SURF\\RawData\\bifen.csv", colClasses =c(rep("character",13), rep("numeric",3), "character","numeric","character"))

names(bif)
##  [1] "Sample.Type"         "Sample.Matrix"       "Sample.Date"        
##  [4] "Catchment.Category"  "Watershed.Category"  "Event.Type"         
##  [7] "Site.Type"           "Waterbody.Type"      "Region"             
## [10] "County"              "Watershed"           "Site.ID"            
## [13] "Analyte.Name"        "Result"              "MDL"                
## [16] "RL"                  "Qualifier"           "Minimum.BM"         
## [19] "Minimum.BM.Endpoint"
head(bif)
##   Sample.Type Sample.Matrix Sample.Date Catchment.Category Watershed.Category
## 1        Norm         Water   3/22/2016                 AG                 AG
## 2        Norm         Water   3/22/2016                 AG                 AG
## 3        Norm         Water   4/11/2016                 AG                 AG
## 4        Norm         Water   4/11/2016                MIX                MIX
## 5        Norm         Water   4/11/2016                 AG                UND
## 6        Norm         Water   6/13/2016                MIX                MIX
##   Event.Type Site.Type        Waterbody.Type Region   County         Watershed
## 1   Nonstorm  Ag Ditch Engineered Conveyance    Imp Imperial       Alamo River
## 2   Nonstorm  Ag Ditch Engineered Conveyance    Imp Imperial         New River
## 3   Nonstorm  Ag Ditch Engineered Conveyance    Sal Monterey     Salinas River
## 4   Nonstorm  Ag Ditch Engineered Conveyance    Sal Monterey Tembladero Slough
## 5   Nonstorm  Ag Ditch Engineered Conveyance    Sal Monterey Tembladero Slough
## 6   Nonstorm  Ag Ditch Engineered Conveyance    Sal Monterey Tembladero Slough
##         Site.ID Analyte.Name    Result MDL    RL Qualifier Minimum.BM
## 1 Imp_Holtville   Bifenthrin -99.00000  NA 0.001        ND     0.0013
## 2     Imp_Rice3   Bifenthrin -99.00000  NA 0.001        ND     0.0013
## 3   Sal_Chualar   Bifenthrin -99.00000  NA 0.001        ND     0.0013
## 4    Sal_SanJon   Bifenthrin   0.00284  NA 0.001         D     0.0013
## 5  Sal_Hartnell   Bifenthrin   0.00791  NA 0.001         D     0.0013
## 6    Sal_SanJon   Bifenthrin   0.00125  NA 0.001         D     0.0013
##   Minimum.BM.Endpoint
## 1                  IC
## 2                  IC
## 3                  IC
## 4                  IC
## 5                  IC
## 6                  IC
#summary(bif$Sample.Date)
#summary(bif$Result)
#summary(bif$RL)

3 Explore the data

summary(bif)
##  Sample.Type        Sample.Matrix      Sample.Date        Catchment.Category
##  Length:791         Length:791         Length:791         Length:791        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Watershed.Category  Event.Type         Site.Type         Waterbody.Type    
##  Length:791         Length:791         Length:791         Length:791        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Region             County           Watershed           Site.ID         
##  Length:791         Length:791         Length:791         Length:791        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Analyte.Name           Result              MDL                RL       
##  Length:791         Min.   :-99.0000   Min.   :0.00091   Min.   :0.001  
##  Class :character   1st Qu.:-99.0000   1st Qu.:0.00091   1st Qu.:0.001  
##  Mode  :character   Median :  0.0017   Median :0.00091   Median :0.001  
##                     Mean   :-40.6596   Mean   :0.00094   Mean   :0.001  
##                     3rd Qu.:  0.0106   3rd Qu.:0.00099   3rd Qu.:0.001  
##                     Max.   :  0.6490   Max.   :0.00099   Max.   :0.001  
##                                        NA's   :164                      
##   Qualifier           Minimum.BM     Minimum.BM.Endpoint
##  Length:791         Min.   :0.0013   Length:791         
##  Class :character   1st Qu.:0.0013   Class :character   
##  Mode  :character   Median :0.0013   Mode  :character   
##                     Mean   :0.0013                      
##                     3rd Qu.:0.0013                      
##                     Max.   :0.0013                      
## 
summary(bif$Result)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -99.0000 -99.0000   0.0017 -40.6596   0.0106   0.6490
summary(bif$RL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.001   0.001   0.001   0.001   0.001   0.001

4 Processing data to created data fields needed for analysis

bif=bif %>% 
  mutate(conc= case_when (Qualifier== 'D'  ~ Result,
                          Qualifier== 'ND' ~ RL)
  )

## create a new field for indicating whether the Conc is censored or not
bif=bif %>% 
  mutate(cen.ind= case_when (Qualifier== 'D'  ~ 0,
                             Qualifier== 'ND' ~ 1)
  )

## Convert sampling data to decimal year
class(bif$Sample.Date)
## [1] "character"
bif$Sample.Date2=as.Date(bif$Sample.Date, format="%m/%d/%Y")

bif=bif %>% 
  mutate(year.dec=decimal_date(Sample.Date2))

summary(bif$year.dec)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2018    2019    2019    2020    2021
summary(bif$RL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.001   0.001   0.001   0.001   0.001   0.001
summary(bif$conc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00100 0.00100 0.00170 0.01718 0.01060 0.64900
bif2= bif[,c('conc', 'cen.ind', 'year.dec', 'Event.Type', 'Site.Type', 'Waterbody.Type', 'Region')]

bif2=bif2 %>% 
  mutate(cen.logic=case_when (cen.ind== 1  ~  TRUE,
                             cen.ind== 0  ~  FALSE)
  )

bif2$year=year(bif$Sample.Date2)

5 Summary statistics

Explore the data set and get a summary statistics for bifenthrin concentraiton

attach(bif2)
cboxplot(conc, cen.ind)

#fit=cenfit(conc, cen.logic, as.factor(year))
#fit
#plot(fit)

cfit(conc, cen.logic)

## 
##  Output for conc              95% Confidence Intervals 
##  Statistics: 
##     N PctND Conf  KMmean    KMsd KMmedian LCLmean UCLmean LCLmedian UCLmedian
##   791 41.09   95 0.01718 0.05009   0.0017 0.01382 0.02055    0.0014   0.00238
## 
## Quantiles:   Q10 Q25 Q50 Q75 Q90 
##          <0.001  <0.001  0.0017  0.0106  0.0451  
##  
enparCensored(conc,cen.ind, ci=TRUE, ci.method="bootstrap", n.bootstraps = 5000)
## $distribution
## [1] "None"
## 
## $sample.size
## [1] 791
## 
## $censoring.side
## [1] "left"
## 
## $censoring.levels
## [1] 0.001
## 
## $percent.censored
## [1] 41.08723
## 
## $parameters
##       mean         sd    se.mean 
## 0.01718497 0.05008888 0.00171501 
## 
## $n.param.est
## [1] 2
## 
## $method
## [1] "Kaplan-Meier"
## 
## $data.name
## [1] "conc"
## 
## $censoring.name
## [1] "cen.ind"
## 
## $bad.obs
## [1] 0
## 
## $interval
## $name
## [1] "Confidence"
## 
## $parameter
## [1] "mean"
## 
## $limits
##    Pct.LCL    Pct.UCL    BCa.LCL    BCa.UCL      t.LCL      t.UCL 
## 0.01441927 0.02029467 0.01441719 0.02028770 0.01456480 0.02068234 
## 
## $type
## [1] "two-sided"
## 
## $method
## [1] "Bootstrap"
## 
## $conf.level
## [1] 0.95
## 
## $sample.size
## [1] 791
## 
## $n.bootstraps
## [1] 5000
## 
## $too.few.obs.count
## [1] 0
## 
## $no.cen.obs.count
## [1] 0
## 
## attr(,"class")
## [1] "intervalEstimateCensored"
## 
## attr(,"class")
## [1] "estimateCensored"

6 Trend analysis

6.1 What’s the distribution of bifenthrin concentration data?

First, check to see if the concentration is normally distributed. From the plots below, we can see that the concentration fits a log-normal distribution (P=0.988). So for the following analysis, we can either use parametric (with log transformation) or non-parametric methods.

cenCompareQQ(conc, cen.ind)
## Best of the three distributions is the lognormal

6.2 Is there a trend over time?

The parametric MLE test shows that bifenthrin Concentration is positively correlated with time but it’s not significant with P-value = 0.507. So no significant trend observed. In addition, the residual plot from the regression shows that it’s not normally distributed indicating issues with the parametric approach. Next, let’s try the non-parametric test using the Akritas-Theil-Sen for censored data. The ATS line has a Kendall’s tau = 0.0181, and p-value = 0.42928. The result is consistent with the MLE test that the trend is not significant.

# MLE simple regression
cencorreg (conc, cen.ind, year.dec, LOG=TRUE, verbose=2)

##  Likelihood R = 0.02361                     AIC = 2646.908 
## Rescaled Likelihood R = 0.02404            BIC = 2659.932 
##  McFaddens R = 0.01292 
## 
## Call:
## survreg(formula = "log(conc)", data = "year.dec", dist = "gaussian")
## 
## Coefficients:
##  (Intercept)     year.dec 
## -89.60682413   0.04125878 
## 
## Scale= 2.365251 
## 
## Loglik(model)= -1320   Loglik(intercept only)= -1320.2
##  Chisq= 0.44 on 1 degrees of freedom, p= 0.507 
## n= 791
# Akritas-Theil-Sen

ATS (conc, cen.ind, year.dec)
## Akritas-Theil-Sen line for censored data 
##  
## ln(conc) = -98.3527 + 0.0447 * year.dec 
## Kendall's tau = 0.0181   p-value = 0.42928 
## 

6.3 Do concentration differ by event type (storm vs. non-storm), site type, water body type and region?

Next, let’s see if concentration trend is associated with other factors such as event type, site type, region and waterbody type. We use the Peto Peto test to test the difference between groups (e.g. event type). The boxplot shows that there is difference in bifenthrin concentration between storm and non-storm events, with storm events producing much higher concentration. There’s also significant differences among different site types, waterbody types and region. The Peto Peto tests also confirms that the difference is statistically significant (P-value < 0.0001).

cboxplot(conc, cen.ind, Event.Type, show = TRUE, LOG=TRUE)

cboxplot(conc, cen.ind, Site.Type, show = TRUE, LOG=TRUE)

cboxplot(conc, cen.ind, Region, show = TRUE, LOG=TRUE)

cboxplot(conc, cen.ind, Waterbody.Type, show = TRUE, LOG=TRUE)

cen1way (conc, cen.ind, Event.Type)
##          grp     N    PctND     KMmean      KMsd   KMmedian
## 1   Nonstorm   549   57.380   0.005469   0.03059     <0.001
## 2      Storm   242    4.132   0.043760   0.07113     0.0181
## 
##       Oneway Peto-Peto test of CensData: conc   by Factor: Event.Type 
##       Chisq = 493   on 1 degrees of freedom     p = 3.17e-109
cen1way (conc, cen.ind, Site.Type)
##             grp     N    PctND     KMmean      KMsd   KMmedian
## 1      Ag Ditch    97   55.670   0.006526   0.02074     <0.001
## 2   Storm Drain   196    4.082   0.034500   0.07390     0.0085
## 3      Waterway   498   52.810   0.012450   0.03986     <0.001
## 
##       Oneway Peto-Peto test of CensData: conc   by Factor: Site.Type 
##       Chisq = 220.3   on 2 degrees of freedom     p = 1.42e-48 
## 
##  Pairwise comparisons using Peto & Peto test 
## 
## data:  CensData and Factor 
## 
##             Ag Ditch Storm Drain
## Storm Drain <2e-16   -          
## Waterway    0.33     <2e-16     
## 
## P value adjustment method: BH
cen1way (conc, cen.ind, Region)
##        grp     N   PctND     KMmean        KMsd   KMmedian
## 1       CV    23   65.22   0.002214   0.0028070     <0.001
## 2      Imp    47   89.36   0.001357   0.0016620     <0.001
## 3   NorCal   199   34.17   0.006918   0.0130700     0.0026
## 4      Sal   115   36.52   0.007557   0.0199800    0.00226
## 5       SM    58   65.52   0.001586   0.0015820     <0.001
## 6    SoCal   278   20.14   0.039800   0.0777600     0.0115
## 7       SV    71   90.14   0.001090   0.0004081     <0.001
## 
##       Oneway Peto-Peto test of CensData: conc   by Factor: Region 
##       Chisq = 247   on 6 degrees of freedom     p = 1.8e-50 
## 
##  Pairwise comparisons using Peto & Peto test 
## 
## data:  CensData and Factor 
## 
##        CV      Imp     NorCal  Sal     SM      SoCal  
## Imp    0.01520 -       -       -       -       -      
## NorCal 0.00652 7.7e-09 -       -       -       -      
## Sal    0.01743 4.9e-08 0.35099 -       -       -      
## SM     0.70252 0.00886 1.4e-06 2.3e-05 -       -      
## SoCal  2.3e-05 1.0e-12 1.2e-11 3.0e-09 1.9e-12 -      
## SV     0.00222 0.84596 3.7e-13 3.5e-12 0.00096 < 2e-16
## 
## P value adjustment method: BH
cen1way (conc, cen.ind, Waterbody.Type)
##                       grp     N   PctND     KMmean      KMsd   KMmedian
## 1   Engineered Conveyance   288   19.79   0.025660   0.06346     0.0056
## 2               Main Stem   468   53.21   0.012780   0.04055     <0.001
## 3        Tributary Stream    35   54.29   0.006376   0.02464     <0.001
## 
##       Oneway Peto-Peto test of CensData: conc   by Factor: Waterbody.Type 
##       Chisq = 96.94   on 2 degrees of freedom     p = 8.91e-22 
## 
##  Pairwise comparisons using Peto & Peto test 
## 
## data:  CensData and Factor 
## 
##                  Engineered Conveyance Main Stem
## Main Stem        < 2e-16               -        
## Tributary Stream 4.2e-06               0.3      
## 
## P value adjustment method: BH
#cencorreg (conc, cen.ind, bif2$Event.Type2, LOG=TRUE, verbose=2)
#cencorreg (conc, cen.ind, as.factor(Site.Type), LOG=TRUE, verbose=2)

#xvar=data.frame(bif2$Event.Type, as.factor(bif$Site.Type))
#cencorreg (conc, cen.ind, xvar, LOG=TRUE, verbose=2)

So, we did not observe a trend overtime for bifenthrin if we look at the entire dataset. However, from the Peto Peto tests, we know that bifenthrin concentrations different by event type, site type, waterbody type and region. Could the trend vary by these four factors? Let’s find out by conducting a seasonal Kendall Test with nondetects.

6.4 Concentration trend by groups

censeaken(year.dec, conc, cen.ind,group=Event.Type, seaplots=TRUE, printstat = T)
## 
##  DATA ANALYZED: conc vs year.dec by Event.Type 
## ----------

##     Season   N     S     tau    pval Intercept      slope
## 1 Nonstorm 549 -7846 -0.0522 0.04208    0.5654 -0.0002798
## ----------

##   Season   N   S     tau   pval Intercept     slope
## 1  Storm 242 236 0.00809 0.8518   -0.1941 0.0001051
## ---------- 
## Seasonal Kendall test and Theil-Sen line

##   reps_R   N  S_SK  tau_SK   pval intercept     slope
## 1   4999 791 -7610 -0.0424 0.0604  -0.18095 9.051e-05
## ----------

censeaken(year.dec, conc, cen.ind,group=Site.Type, seaplots=TRUE)
## 
##  DATA ANALYZED: conc vs year.dec by Site.Type 
## ----------

##     Season  N  S    tau    pval Intercept    slope
## 1 Ag Ditch 97 59 0.0127 0.84096 -0.079212 3.97e-05
## ----------

##        Season   N    S    tau   pval Intercept     slope
## 1 Storm Drain 196 1141 0.0597 0.2142   -1.1599 0.0005791
## ----------

##     Season   N    S    tau    pval Intercept     slope
## 1 Waterway 498 4969 0.0402 0.14632  -0.41155 0.0002043
## ---------- 
## Seasonal Kendall test and Theil-Sen line

##   reps_R   N S_SK tau_SK   pval intercept     slope
## 1   4999 791 6169 0.0418 0.0838  -0.18095 9.051e-05
## ----------

censeaken(year.dec, conc, cen.ind,group=Region, seaplots=TRUE)
## 
##  DATA ANALYZED: conc vs year.dec by Region 
## ----------

##   Season  N  S    tau    pval Intercept    slope
## 1     CV 23 16 0.0632 0.61636   -1.2617 0.000625
## ----------

##   Season  N    S    tau     pval Intercept     slope
## 1    Imp 47 -110 -0.102 0.024291    7.4782 -0.003705
## ----------

##   Season   N     S    tau     pval Intercept      slope
## 1 NorCal 199 -2117 -0.107 0.021206    1.3555 -0.0006701
## ----------

##   Season   N    S     tau    pval Intercept      slope
## 1    Sal 115 -345 -0.0526 0.39192   0.58519 -0.0002888
## ----------

##   Season  N  S    tau    pval Intercept     slope
## 1     SM 58 29 0.0175 0.81984  -0.10728 5.359e-05
## ----------

##   Season   N    S    tau     pval Intercept    slope
## 1  SoCal 278 3840 0.0997 0.012768   -4.0931 0.002033
## ----------

##   Season  N   S     tau    pval Intercept      slope
## 1     SV 71 -30 -0.0121 0.75579   0.35365 -0.0001748
## ---------- 
## Seasonal Kendall test and Theil-Sen line

##   reps_R   N S_SK tau_SK  pval intercept     slope
## 1   4999 791 1283 0.0183 0.499  -0.18095 9.051e-05
## ----------

censeaken(year.dec, conc, cen.ind,group=Waterbody.Type, seaplots=TRUE)
## 
##  DATA ANALYZED: conc vs year.dec by Waterbody.Type 
## ----------

##                  Season   N   S    tau    pval Intercept     slope
## 1 Engineered Conveyance 288 789 0.0191 0.62799   -0.2886 0.0001456
## ----------

##      Season   N    S    tau     pval Intercept     slope
## 1 Main Stem 468 5634 0.0516 0.070044  -0.68272 0.0003386
## ----------

##             Season  N    S    tau     pval Intercept      slope
## 1 Tributary Stream 35 -116 -0.195 0.068404    1.0265 -0.0005082
## ---------- 
## Seasonal Kendall test and Theil-Sen line

##   reps_R   N S_SK tau_SK  pval intercept     slope
## 1   4999 791 6307 0.0417 0.071  -0.18095 9.051e-05
## ----------

The results above show that the concentrations during non-storm events slightly decreased over time (P-value = 0.042), while the concentrations during storm events does not have a trend overtime and concentration stays pretty much at similar levels (slope = 0.0001). Bifenthrin trends also vary among different regions. In Imperial Valley and Northern California, the concentrations have decreased over time. However, bifenthrin concentration increased in Southern California. In Central Valley, Santa Maria and Salinas Valley, no significant trends are observed over time. No different trends are observed among different site and waterbody types.

7 Conclusions

Bifenthrin concentration over the entire State fits a log-normal distribution. About 41% of the data is censored, meaning that the concentration is below level of quantification (LOQ). The maximum LOQ is 0.001 ppb. A total of 791 samples were taken during the years between 2016-2021. The mean concentration is 0.017 ppb while the median is 0.0017 ppb. The 90th percentile concentration is 0.0151 ppb. Concentrations vary by event type (storm vs. non-storm), site type, water body type and region. Concentrations during storm events are significant higher than those at non-storm events. Samples taken from urban storm drains have significantly higher concentration compared to those taken from AG ditches or waterways. Sampling sites at Southern California have the highest concentrations followed by those in Northern California and Salinas Valley. Imperial Valley sites have the lowest concentrations among all the regions.

In general, bifenthrin data throughout the state does not display sharp changes over time. The concentrations during storm events does not have a trend overtime and concentration stays pretty much at similar levels (slope = 0.0001).However, the concentrations during non-storm events are decreasing over time (P-value = 0.042) with a small slope (-0.0002). In Imperial Valley and Northern California, the concentrations have decreased over time. However, bifenthrin concentration increased in Southern California. In Central Valley, Santa Maria and Salinas Valley, no significant trends are observed over time. No different trends are observed among different site and waterbody types.