The example data is organized by monitoring site, region, and date. Different sites and years may have different detection limits.
| Site_ID | Region | Date | Concentration | Detect_Limit |
|---|---|---|---|---|
| 270535501 | 1 | 2009-07-30 | 0.00148 | 0.10 |
| 270535501 | 1 | 2009-09-30 | 0.00064 | 0.10 |
| 270535501 | 1 | 2009-11-30 | 0.24256 | 0.10 |
| 270535501 | 1 | 2009-12-30 | 0.00064 | 0.10 |
| 270535501 | 1 | 2010-01-30 | 0.16001 | 0.10 |
| 270535501 | 1 | 2010-03-30 | 0.19300 | 0.10 |
| 270535502 | 1 | 2009-03-30 | 0.21219 | 0.01 |
| 270535502 | 1 | 2009-07-30 | 0.01113 | 0.01 |
| 270535502 | 1 | 2009-09-30 | 0.00044 | 0.01 |
| 270535502 | 1 | 2009-11-30 | 0.00127 | 0.01 |
| 270535502 | 1 | 2009-12-30 | 0.00613 | 0.01 |
| 270535502 | 1 | 2010-01-30 | 0.02216 | 0.01 |
| 270535502 | 1 | 2010-03-30 | 0.08113 | 0.02 |
For air monitoring data we currently use the EnvStats package to generate annual summary values and confidence intervals. The package provides multiple non-detect methods that will automatically calculate confidence intervals for you. The methods below will help reduce the effects of different detection limits between sites and years.
To begin, add a column indicating if the sample was detected or not. When the concentration is less than the detection limit the Censored column is set to TRUE.
df$Censored <- (df$Concentration) < (df$Detect_Limit)| Site_ID | Region | Date | Concentration | Detect_Limit | Censored |
|---|---|---|---|---|---|
| 270535501 | 1 | 2009-07-30 | 0.00148 | 0.1 | TRUE |
| 270535501 | 1 | 2009-09-30 | 0.00064 | 0.1 | TRUE |
| 270535501 | 1 | 2009-11-30 | 0.24256 | 0.1 | FALSE |
| 270535501 | 1 | 2009-12-30 | 0.00064 | 0.1 | TRUE |
| 270535501 | 1 | 2010-01-30 | 0.16001 | 0.1 | FALSE |
| 270535501 | 1 | 2010-03-30 | 0.19300 | 0.1 | FALSE |
Similar to NADA, EnvStats will assume a detection limit is located at the value of the lowest detected concentration. To force the summary to show the correct detection limits you can substitute the detection limit in for the non-detected values.
# Save the original mean of the raw values
raw_mean <- mean(df$Concentration, na.rm=T)
# Replace non-detects with the detection limit
df$Concentration <- ifelse(df$Censored, df$Detect_Limit, df$Concentration)
Next we can use EnvStats to estimate the mean for the entire region. Because there are multiple detection limits the potential methods we can use are more limited. For multiple detection limits with limited amounts of detections (less than 10), Helsel recommends using Kaplan-Meier. Here’s an example of running Kaplan-Meier in EnvStats using the function enparCensored().
library(EnvStats)
# Kaplan-Meier
# summary for all sites and dates
km_summary <- enparCensored(df$Concentration,
df$Censored,
ci=T,
ci.method = "bootstrap",
n.bootstraps = 3000)
print(km_summary)##
## Results of Distribution Parameter Estimation
## Based on Type I Censored Data
## --------------------------------------------
##
## Assumed Distribution: None
##
## Censoring Side: left
##
## Censoring Level(s): 0.01 0.10
##
## Estimated Parameter(s): mean = 0.07879923
## sd = 0.08648545
## se.mean = 0.01346641
##
## Estimation Method: Kaplan-Meier
##
## Data: df$Concentration
##
## Censoring Variable: df$Censored
##
## Sample Size: 13
##
## Percent Censored: 46.15385%
##
## Confidence Interval for: mean
##
## Assumed Sample Size: 13
##
## Confidence Interval Method: Bootstrap
##
## Number of Bootstraps: 3000
##
## Number of Bootstrap Samples
## With No Censored Values: 1
##
## Number of Times Bootstrap
## Repeated Because Too Few
## Uncensored Observations: 1
##
## Confidence Interval Type: two-sided
##
## Confidence Level: 95%
##
## Confidence Interval: Pct.LCL = 0.03936345
## Pct.UCL = 0.11841510
## BCa.LCL = 0.03888024
## BCa.UCL = 0.11788602
## t.LCL = 0.04240100
## t.UCL = 0.16189737
To pull out the mean value from the results use: $parameters[[1]]
km_mean <- enparCensored(df$Concentration, df$Censored)$parameters[[1]]
paste("The Kaplan-Meier mean is", round(km_mean, 3))## [1] "The Kaplan-Meier mean is 0.079"
Compare with the NADA package
library(NADA)
Nada_KM_Mean <- cenfit(df$Concentration, df$Censored)
print(Nada_KM_Mean)## n n.cen median mean sd
## 13.00000000 6.00000000 0.02216000 0.07919038 0.09448937
When you have larger data sets there are additional methods such as ROS and MLE, which use the the observed distribution of the detected values and applies it to predict the location of non-detect values. These methods usually generate less conservative (lower estimates of the mean), but in most cases are a closer approximation to reality. Below are 2 examples of using the ROS and MLE methods in the EnvStats package.
library(EnvStats)
# ROS
# Summary across all sites and dates
ros_summary <- enormCensored(df$Concentration,
df$Censored,
method = "rROS",
ci = F,
lb.impute = min(df$Detect_Limit) * 0.10,
ub.impute = max(df$Detect_Limit))
# The lower bound imputed limt (lb.impute) sets the minimum assigned value for non-detects. If you know the background concentration for a body of water that could be a good minimum value. The example above uses 10% of the minimum detection limit for the lower bound.
# The upper bound imputed limt (ub.impute) sets the maximum assigned value for non-detects. It's set at the highest detection limit above.
print(ros_summary)##
## Results of Distribution Parameter Estimation
## Based on Type I Censored Data
## --------------------------------------------
##
## Assumed Distribution: Normal
##
## Censoring Side: left
##
## Censoring Level(s): 0.01 0.10
##
## Estimated Parameter(s): mean = 0.07624475
## sd = 0.09239318
##
## Estimation Method: Imputation With Q-Q Regression (rROS)
##
## Plotting Position Method: hirsch-stedinger
##
## Plotting Position Constant: 0.375
##
## Data: df$Concentration
##
## Censoring Variable: df$Censored
##
## Sample Size: 13
##
## Percent Censored: 46.15385%
# MLE
#summary across all sites and dates
mle_summary <- enormCensored(df$Concentration,
df$Censored,
method = "mle",
ci = F)
print(mle_summary)##
## Results of Distribution Parameter Estimation
## Based on Type I Censored Data
## --------------------------------------------
##
## Assumed Distribution: Normal
##
## Censoring Side: left
##
## Censoring Level(s): 0.01 0.10
##
## Estimated Parameter(s): mean = 0.04964891
## sd = 0.12195139
##
## Estimation Method: MLE
##
## Data: df$Concentration
##
## Censoring Variable: df$Censored
##
## Sample Size: 13
##
## Percent Censored: 46.15385%
Compare to substitution of 1/2 the detection limt
sub_half <- df
sub_half$Concentration <- ifelse(df$Censored,
df$Detect_Limit * 0.5,
df$Concentration)
sub_half_DL_Mean <- mean(sub_half$Concentration)
print(sub_half_DL_Mean)## [1] 0.08362923
Compare to random substitution between 0 and the detection limit
set.seed(8*3*2016)
sub_random <- df
sub_random$Concentration <- ifelse(df$Censored,
runif(1, 0, df$Detect_Limit),
df$Concentration)
sub_random_Mean <- mean(sub_random$Concentration)
print(sub_random_Mean)## [1] 0.08726135
When your data has a single detection limit, as for air monitoring data from a single year, annual statistics can be generated using the imputed MLE method below.
# Imputed MLE
#summary across a single site with one detection limit
site1 <- filter(df, Site_ID == 270535501)
mle_summary <- enormCensored(site1$Concentration,
site1$Censored,
method = "impute.w.mle",
lb.impute = min(site1$Detect_Limit) * 0.10,
ci = T,
ci.method = "bootstrap")
#print(mle_summary)Below is a summary table of the non-detect methods used for the region overall. For comparison I’ve also included the mean of the original values before censoring, and the mean when we replace all of the non-detects with the detection limit.
library(dplyr)
summary <- df %>%
summarize('Raw values' = raw_mean,
'Kap-M' = enparCensored(Concentration, Censored)$parameters[[1]],
'ROS' = enormCensored(Concentration,
Censored,
method = "rROS",
lb.impute = min(df$Detect_Limit)*0.10,
ub.impute = max(df$Detect_Limit))$parameters[[1]],
'MLE' = enormCensored(Concentration,
Censored,
method = "mle")$parameters[[1]],
'Sub 1/2 DL' = sub_half_DL_Mean,
'Sub random' = sub_random_Mean,
'Sub DL' = mean(df$Concentration)
)
knitr::kable(summary, digits = 3)| Raw values | Kap-M | ROS | MLE | Sub 1/2 DL | Sub random | Sub DL |
|---|---|---|---|---|---|---|
| 0.072 | 0.079 | 0.076 | 0.05 | 0.084 | 0.087 | 0.096 |
DENNIS HELSEL - http://annhyg.oxfordjournals.org/content/54/3/257.full
RONALD C. ANTWEILER - http://pubs.acs.org/doi/pdf/10.1021/es071301c