This project aims to establish methods for statistical analysis with non-detects. Pesticide monitoring data often contain substantial amount of non-detects, which are concentrations that are below level of quantification. Traditional methods used substitutions or treat non-detects as zeros. These methods are problematic and could result in many problems such as failing to detect trends when there is or detecting trends that are artificially introduced by substitution. Helsel (2012) developed statistical methods that can perform many kinds of statistical analysis without substituting non-detects. These methods are included in a R package called NADA. Later in 2021, Helsel developed version 2 of the NADA package (NADA2). This document demonstrates a few examples of using the NADA and NADA2 R packages for analyzing pesticide monitoring data in SURF.
bif = read.csv("D:\\Statistical_Learning\\NADA_SURF\\RawData\\bifen.csv", colClasses =c(rep("character",13), rep("numeric",3), "character","numeric","character"))
names(bif)
## [1] "Sample.Type" "Sample.Matrix" "Sample.Date"
## [4] "Catchment.Category" "Watershed.Category" "Event.Type"
## [7] "Site.Type" "Waterbody.Type" "Region"
## [10] "County" "Watershed" "Site.ID"
## [13] "Analyte.Name" "Result" "MDL"
## [16] "RL" "Qualifier" "Minimum.BM"
## [19] "Minimum.BM.Endpoint"
head(bif)
## Sample.Type Sample.Matrix Sample.Date Catchment.Category Watershed.Category
## 1 Norm Water 3/22/2016 AG AG
## 2 Norm Water 3/22/2016 AG AG
## 3 Norm Water 4/11/2016 AG AG
## 4 Norm Water 4/11/2016 MIX MIX
## 5 Norm Water 4/11/2016 AG UND
## 6 Norm Water 6/13/2016 MIX MIX
## Event.Type Site.Type Waterbody.Type Region County Watershed
## 1 Nonstorm Ag Ditch Engineered Conveyance Imp Imperial Alamo River
## 2 Nonstorm Ag Ditch Engineered Conveyance Imp Imperial New River
## 3 Nonstorm Ag Ditch Engineered Conveyance Sal Monterey Salinas River
## 4 Nonstorm Ag Ditch Engineered Conveyance Sal Monterey Tembladero Slough
## 5 Nonstorm Ag Ditch Engineered Conveyance Sal Monterey Tembladero Slough
## 6 Nonstorm Ag Ditch Engineered Conveyance Sal Monterey Tembladero Slough
## Site.ID Analyte.Name Result MDL RL Qualifier Minimum.BM
## 1 Imp_Holtville Bifenthrin -99.00000 NA 0.001 ND 0.0013
## 2 Imp_Rice3 Bifenthrin -99.00000 NA 0.001 ND 0.0013
## 3 Sal_Chualar Bifenthrin -99.00000 NA 0.001 ND 0.0013
## 4 Sal_SanJon Bifenthrin 0.00284 NA 0.001 D 0.0013
## 5 Sal_Hartnell Bifenthrin 0.00791 NA 0.001 D 0.0013
## 6 Sal_SanJon Bifenthrin 0.00125 NA 0.001 D 0.0013
## Minimum.BM.Endpoint
## 1 IC
## 2 IC
## 3 IC
## 4 IC
## 5 IC
## 6 IC
#summary(bif$Sample.Date)
#summary(bif$Result)
#summary(bif$RL)
summary(bif)
## Sample.Type Sample.Matrix Sample.Date Catchment.Category
## Length:791 Length:791 Length:791 Length:791
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Watershed.Category Event.Type Site.Type Waterbody.Type
## Length:791 Length:791 Length:791 Length:791
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Region County Watershed Site.ID
## Length:791 Length:791 Length:791 Length:791
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Analyte.Name Result MDL RL
## Length:791 Min. :-99.0000 Min. :0.00091 Min. :0.001
## Class :character 1st Qu.:-99.0000 1st Qu.:0.00091 1st Qu.:0.001
## Mode :character Median : 0.0017 Median :0.00091 Median :0.001
## Mean :-40.6596 Mean :0.00094 Mean :0.001
## 3rd Qu.: 0.0106 3rd Qu.:0.00099 3rd Qu.:0.001
## Max. : 0.6490 Max. :0.00099 Max. :0.001
## NA's :164
## Qualifier Minimum.BM Minimum.BM.Endpoint
## Length:791 Min. :0.0013 Length:791
## Class :character 1st Qu.:0.0013 Class :character
## Mode :character Median :0.0013 Mode :character
## Mean :0.0013
## 3rd Qu.:0.0013
## Max. :0.0013
##
summary(bif$Result)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -99.0000 -99.0000 0.0017 -40.6596 0.0106 0.6490
summary(bif$RL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001 0.001 0.001 0.001 0.001 0.001
bif=bif %>%
mutate(conc= case_when (Qualifier== 'D' ~ Result,
Qualifier== 'ND' ~ RL)
)
## create a new field for indicating whether the Conc is censored or not
bif=bif %>%
mutate(cen.ind= case_when (Qualifier== 'D' ~ 0,
Qualifier== 'ND' ~ 1)
)
## Convert sampling data to decimal year
class(bif$Sample.Date)
## [1] "character"
bif$Sample.Date2=as.Date(bif$Sample.Date, format="%m/%d/%Y")
bif=bif %>%
mutate(year.dec=decimal_date(Sample.Date2))
summary(bif$year.dec)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2016 2018 2019 2019 2020 2021
summary(bif$RL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001 0.001 0.001 0.001 0.001 0.001
summary(bif$conc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00100 0.00100 0.00170 0.01718 0.01060 0.64900
bif2= bif[,c('conc', 'cen.ind', 'year.dec', 'Event.Type', 'Site.Type', 'Waterbody.Type', 'Region')]
bif2=bif2 %>%
mutate(cen.logic=case_when (cen.ind== 1 ~ TRUE,
cen.ind== 0 ~ FALSE)
)
bif2$year=year(bif$Sample.Date2)
Explore the data set and get a summary statistics for bifenthrin concentraiton
attach(bif2)
cboxplot(conc, cen.ind)
#fit=cenfit(conc, cen.logic, as.factor(year))
#fit
#plot(fit)
cfit(conc, cen.logic)
##
## Output for conc 95% Confidence Intervals
## Statistics:
## N PctND Conf KMmean KMsd KMmedian LCLmean UCLmean LCLmedian UCLmedian
## 791 41.09 95 0.01718 0.05009 0.0017 0.01382 0.02055 0.0014 0.00238
##
## Quantiles: Q10 Q25 Q50 Q75 Q90
## <0.001 <0.001 0.0017 0.0106 0.0451
##
enparCensored(conc,cen.ind, ci=TRUE, ci.method="bootstrap", n.bootstraps = 5000)
## $distribution
## [1] "None"
##
## $sample.size
## [1] 791
##
## $censoring.side
## [1] "left"
##
## $censoring.levels
## [1] 0.001
##
## $percent.censored
## [1] 41.08723
##
## $parameters
## mean sd se.mean
## 0.01718497 0.05008888 0.00171501
##
## $n.param.est
## [1] 2
##
## $method
## [1] "Kaplan-Meier"
##
## $data.name
## [1] "conc"
##
## $censoring.name
## [1] "cen.ind"
##
## $bad.obs
## [1] 0
##
## $interval
## $name
## [1] "Confidence"
##
## $parameter
## [1] "mean"
##
## $limits
## Pct.LCL Pct.UCL BCa.LCL BCa.UCL t.LCL t.UCL
## 0.01441927 0.02029467 0.01441719 0.02028770 0.01456480 0.02068234
##
## $type
## [1] "two-sided"
##
## $method
## [1] "Bootstrap"
##
## $conf.level
## [1] 0.95
##
## $sample.size
## [1] 791
##
## $n.bootstraps
## [1] 5000
##
## $too.few.obs.count
## [1] 0
##
## $no.cen.obs.count
## [1] 0
##
## attr(,"class")
## [1] "intervalEstimateCensored"
##
## attr(,"class")
## [1] "estimateCensored"
First, check to see if the concentration is normally distributed. From the plots below, we can see that the concentration fits a log-normal distribution (P=0.988). So for the following analysis, we can either use parametric (with log transformation) or non-parametric methods.
cenCompareQQ(conc, cen.ind)
## Best of the three distributions is the lognormal
The parametric MLE test shows that bifenthrin Concentration is positively correlated with time but it’s not significant with P-value = 0.507. So no significant trend observed. In addition, the residual plot from the regression shows that it’s not normally distributed indicating issues with the parametric approach. Next, let’s try the non-parametric test using the Akritas-Theil-Sen for censored data. The ATS line has a Kendall’s tau = 0.0181, and p-value = 0.42928. The result is consistent with the MLE test that the trend is not significant.
# MLE simple regression
cencorreg (conc, cen.ind, year.dec, LOG=TRUE, verbose=2)
## Likelihood R = 0.02361 AIC = 2646.908
## Rescaled Likelihood R = 0.02404 BIC = 2659.932
## McFaddens R = 0.01292
##
## Call:
## survreg(formula = "log(conc)", data = "year.dec", dist = "gaussian")
##
## Coefficients:
## (Intercept) year.dec
## -89.60682413 0.04125878
##
## Scale= 2.365251
##
## Loglik(model)= -1320 Loglik(intercept only)= -1320.2
## Chisq= 0.44 on 1 degrees of freedom, p= 0.507
## n= 791
# Akritas-Theil-Sen
ATS (conc, cen.ind, year.dec)
## Akritas-Theil-Sen line for censored data
##
## ln(conc) = -98.3527 + 0.0447 * year.dec
## Kendall's tau = 0.0181 p-value = 0.42928
##
Next, let’s see if concentration trend is associated with other factors such as event type, site type, region and waterbody type. We use the Peto Peto test to test the difference between groups (e.g. event type). The boxplot shows that there is difference in bifenthrin concentration between storm and non-storm events, with storm events producing much higher concentration. There’s also significant differences among different site types, waterbody types and region. The Peto Peto tests also confirms that the difference is statistically significant (P-value < 0.0001).
cboxplot(conc, cen.ind, Event.Type, show = TRUE, LOG=TRUE)
cboxplot(conc, cen.ind, Site.Type, show = TRUE, LOG=TRUE)
cboxplot(conc, cen.ind, Region, show = TRUE, LOG=TRUE)
cboxplot(conc, cen.ind, Waterbody.Type, show = TRUE, LOG=TRUE)
cen1way (conc, cen.ind, Event.Type)
## grp N PctND KMmean KMsd KMmedian
## 1 Nonstorm 549 57.380 0.005469 0.03059 <0.001
## 2 Storm 242 4.132 0.043760 0.07113 0.0181
##
## Oneway Peto-Peto test of CensData: conc by Factor: Event.Type
## Chisq = 493 on 1 degrees of freedom p = 3.17e-109
cen1way (conc, cen.ind, Site.Type)
## grp N PctND KMmean KMsd KMmedian
## 1 Ag Ditch 97 55.670 0.006526 0.02074 <0.001
## 2 Storm Drain 196 4.082 0.034500 0.07390 0.0085
## 3 Waterway 498 52.810 0.012450 0.03986 <0.001
##
## Oneway Peto-Peto test of CensData: conc by Factor: Site.Type
## Chisq = 220.3 on 2 degrees of freedom p = 1.42e-48
##
## Pairwise comparisons using Peto & Peto test
##
## data: CensData and Factor
##
## Ag Ditch Storm Drain
## Storm Drain <2e-16 -
## Waterway 0.33 <2e-16
##
## P value adjustment method: BH
cen1way (conc, cen.ind, Region)
## grp N PctND KMmean KMsd KMmedian
## 1 CV 23 65.22 0.002214 0.0028070 <0.001
## 2 Imp 47 89.36 0.001357 0.0016620 <0.001
## 3 NorCal 199 34.17 0.006918 0.0130700 0.0026
## 4 Sal 115 36.52 0.007557 0.0199800 0.00226
## 5 SM 58 65.52 0.001586 0.0015820 <0.001
## 6 SoCal 278 20.14 0.039800 0.0777600 0.0115
## 7 SV 71 90.14 0.001090 0.0004081 <0.001
##
## Oneway Peto-Peto test of CensData: conc by Factor: Region
## Chisq = 247 on 6 degrees of freedom p = 1.8e-50
##
## Pairwise comparisons using Peto & Peto test
##
## data: CensData and Factor
##
## CV Imp NorCal Sal SM SoCal
## Imp 0.01520 - - - - -
## NorCal 0.00652 7.7e-09 - - - -
## Sal 0.01743 4.9e-08 0.35099 - - -
## SM 0.70252 0.00886 1.4e-06 2.3e-05 - -
## SoCal 2.3e-05 1.0e-12 1.2e-11 3.0e-09 1.9e-12 -
## SV 0.00222 0.84596 3.7e-13 3.5e-12 0.00096 < 2e-16
##
## P value adjustment method: BH
cen1way (conc, cen.ind, Waterbody.Type)
## grp N PctND KMmean KMsd KMmedian
## 1 Engineered Conveyance 288 19.79 0.025660 0.06346 0.0056
## 2 Main Stem 468 53.21 0.012780 0.04055 <0.001
## 3 Tributary Stream 35 54.29 0.006376 0.02464 <0.001
##
## Oneway Peto-Peto test of CensData: conc by Factor: Waterbody.Type
## Chisq = 96.94 on 2 degrees of freedom p = 8.91e-22
##
## Pairwise comparisons using Peto & Peto test
##
## data: CensData and Factor
##
## Engineered Conveyance Main Stem
## Main Stem < 2e-16 -
## Tributary Stream 4.2e-06 0.3
##
## P value adjustment method: BH
#cencorreg (conc, cen.ind, bif2$Event.Type2, LOG=TRUE, verbose=2)
#cencorreg (conc, cen.ind, as.factor(Site.Type), LOG=TRUE, verbose=2)
#xvar=data.frame(bif2$Event.Type, as.factor(bif$Site.Type))
#cencorreg (conc, cen.ind, xvar, LOG=TRUE, verbose=2)
So, we did not observe a trend overtime for bifenthrin if we look at the entire dataset. However, from the Peto Peto tests, we know that bifenthrin concentrations different by event type, site type, waterbody type and region. Could the trend vary by these four factors? Let’s find out by conducting a seasonal Kendall Test with nondetects.
censeaken(year.dec, conc, cen.ind,group=Event.Type, seaplots=TRUE, printstat = T)
##
## DATA ANALYZED: conc vs year.dec by Event.Type
## ----------
## Season N S tau pval Intercept slope
## 1 Nonstorm 549 -7846 -0.0522 0.04208 0.5654 -0.0002798
## ----------
## Season N S tau pval Intercept slope
## 1 Storm 242 236 0.00809 0.8518 -0.1941 0.0001051
## ----------
## Seasonal Kendall test and Theil-Sen line
## reps_R N S_SK tau_SK pval intercept slope
## 1 4999 791 -7610 -0.0424 0.0604 -0.18095 9.051e-05
## ----------
censeaken(year.dec, conc, cen.ind,group=Site.Type, seaplots=TRUE)
##
## DATA ANALYZED: conc vs year.dec by Site.Type
## ----------
## Season N S tau pval Intercept slope
## 1 Ag Ditch 97 59 0.0127 0.84096 -0.079212 3.97e-05
## ----------
## Season N S tau pval Intercept slope
## 1 Storm Drain 196 1141 0.0597 0.2142 -1.1599 0.0005791
## ----------
## Season N S tau pval Intercept slope
## 1 Waterway 498 4969 0.0402 0.14632 -0.41155 0.0002043
## ----------
## Seasonal Kendall test and Theil-Sen line
## reps_R N S_SK tau_SK pval intercept slope
## 1 4999 791 6169 0.0418 0.0838 -0.18095 9.051e-05
## ----------
censeaken(year.dec, conc, cen.ind,group=Region, seaplots=TRUE)
##
## DATA ANALYZED: conc vs year.dec by Region
## ----------
## Season N S tau pval Intercept slope
## 1 CV 23 16 0.0632 0.61636 -1.2617 0.000625
## ----------
## Season N S tau pval Intercept slope
## 1 Imp 47 -110 -0.102 0.024291 7.4782 -0.003705
## ----------
## Season N S tau pval Intercept slope
## 1 NorCal 199 -2117 -0.107 0.021206 1.3555 -0.0006701
## ----------
## Season N S tau pval Intercept slope
## 1 Sal 115 -345 -0.0526 0.39192 0.58519 -0.0002888
## ----------
## Season N S tau pval Intercept slope
## 1 SM 58 29 0.0175 0.81984 -0.10728 5.359e-05
## ----------
## Season N S tau pval Intercept slope
## 1 SoCal 278 3840 0.0997 0.012768 -4.0931 0.002033
## ----------
## Season N S tau pval Intercept slope
## 1 SV 71 -30 -0.0121 0.75579 0.35365 -0.0001748
## ----------
## Seasonal Kendall test and Theil-Sen line
## reps_R N S_SK tau_SK pval intercept slope
## 1 4999 791 1283 0.0183 0.499 -0.18095 9.051e-05
## ----------
censeaken(year.dec, conc, cen.ind,group=Waterbody.Type, seaplots=TRUE)
##
## DATA ANALYZED: conc vs year.dec by Waterbody.Type
## ----------
## Season N S tau pval Intercept slope
## 1 Engineered Conveyance 288 789 0.0191 0.62799 -0.2886 0.0001456
## ----------
## Season N S tau pval Intercept slope
## 1 Main Stem 468 5634 0.0516 0.070044 -0.68272 0.0003386
## ----------
## Season N S tau pval Intercept slope
## 1 Tributary Stream 35 -116 -0.195 0.068404 1.0265 -0.0005082
## ----------
## Seasonal Kendall test and Theil-Sen line
## reps_R N S_SK tau_SK pval intercept slope
## 1 4999 791 6307 0.0417 0.071 -0.18095 9.051e-05
## ----------
The results above show that the concentrations during non-storm events slightly decreased over time (P-value = 0.042), while the concentrations during storm events does not have a trend overtime and concentration stays pretty much at similar levels (slope = 0.0001). Bifenthrin trends also vary among different regions. In Imperial Valley and Northern California, the concentrations have decreased over time. However, bifenthrin concentration increased in Southern California. In Central Valley, Santa Maria and Salinas Valley, no significant trends are observed over time. No different trends are observed among different site and waterbody types.
Bifenthrin concentration over the entire State fits a log-normal distribution. About 41% of the data is censored, meaning that the concentration is below level of quantification (LOQ). The maximum LOQ is 0.001 ppb. A total of 791 samples were taken during the years between 2016-2021. The mean concentration is 0.017 ppb while the median is 0.0017 ppb. The 90th percentile concentration is 0.0151 ppb. Concentrations vary by event type (storm vs. non-storm), site type, water body type and region. Concentrations during storm events are significant higher than those at non-storm events. Samples taken from urban storm drains have significantly higher concentration compared to those taken from AG ditches or waterways. Sampling sites at Southern California have the highest concentrations followed by those in Northern California and Salinas Valley. Imperial Valley sites have the lowest concentrations among all the regions.
In general, bifenthrin data throughout the state does not display sharp changes over time. The concentrations during storm events does not have a trend overtime and concentration stays pretty much at similar levels (slope = 0.0001).However, the concentrations during non-storm events are decreasing over time (P-value = 0.042) with a small slope (-0.0002). In Imperial Valley and Northern California, the concentrations have decreased over time. However, bifenthrin concentration increased in Southern California. In Central Valley, Santa Maria and Salinas Valley, no significant trends are observed over time. No different trends are observed among different site and waterbody types.