First, I imported two dataframes to R. The first was the complete WQI dataset containing data from 2008-2017 (n=8118 observations) for all parameters that the WQI scoring is based on (wqi_df), i.e. the ‘All Data’ tab in the excel WQI calculator. Here is a view of the first couple columns.
| StationID | StationName | AliasID | Date | Aluminum_mg_l | Chloride_mg_l | Iron_mg_l | Nitrate_mg_l | Phosphorus_mg_l | Sodium_mg_l | Sulfate_mg_l | TOC_mg_l | Manganese_mg_l |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 474 | CHEM002.5-4176 | CHEM 2.5 | 9/14/2010 | 0.200 | 69.0 | 0.082 | 0.66 | 0.102 | 44.8 | 21.0 | 2.83 | 0.022 |
| 1919 | HLDN000.0-00001 | Holden Ck US | 5/1/2012 | 0.160 | 7.2 | 0.230 | 0.10 | 0.026 | 5.4 | 9.2 | 3.10 | 0.012 |
| 1920 | HLDN000.0-00002 | Holden Ck DS | 5/1/2012 | 0.160 | 7.2 | 0.230 | 0.10 | 0.026 | 5.4 | 9.2 | 3.10 | 0.012 |
| 1931 | COWNR00.0-00004 | Cowanesque R US | 5/31/2012 | 0.056 | 10.0 | 0.120 | 0.19 | 0.021 | 6.9 | 10.0 | 2.80 | 0.027 |
| 1932 | COWNR00.0-00005 | Cowanesque R DS | 5/31/2012 | 0.056 | 10.0 | 0.120 | 0.19 | 0.021 | 6.9 | 10.0 | 2.80 | 0.027 |
| 466 | CNST038.7-4277 | CNST 38.7 | 6/19/2012 | 0.072 | 55.2 | 0.190 | 2.00 | 0.200 | 27.2 | 27.9 | 3.50 | 0.035 |
I then imported a second dataframe of sample data from the ARS Program Refinement effort (n= 120) that was scored previously using the Excel scorer (wqi_data_full). Note - the dataframe has the same 9 water quality parameters. The columns are named identically to wqi_df, but they don’t have to be. The cfm column indicates flow (cfs) per square mile. The TSS column is generated from random numbers. These two columns are used for the caution flags based on predetermined thresholds. Notice there are NA values, that will not present a problem.
| Project | unique_id_chrono | Date | TSS_mg_l | cfm | Aluminum_mg_l | Chloride_mg_l | Iron_mg_l | Nitrate_mg_l | Phosphorus_mg_l | Sodium_mg_l | Sulfate_mg_l | TOC_mg_l | Manganese_mg_l |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ARS III | BALD_01_2 | 2012-05-07 | 40.04289 | 0.9762082 | 0.095 | 4.20 | 0.220 | 0.11 | 0.029 | 3.100 | 9.5 | 0.5 | 0.035 |
| ARS III | BALD_01_1 | 2010-07-14 | 137.04372 | NA | NA | 7.53 | 0.319 | NA | NA | 5.778 | 13.0 | NA | 0.049 |
| ARS III | BOWM_01_3 | 2016-09-19 | 183.37515 | 0.9475834 | 0.110 | 8.90 | 0.180 | 0.39 | 0.023 | 5.600 | 6.2 | 1.8 | 0.031 |
| ARS III | BOWM_01_2 | 2012-06-06 | 56.87989 | 1.1008200 | 0.087 | 4.70 | 0.120 | 0.25 | 0.021 | 3.200 | 6.6 | 1.5 | 0.010 |
| ARS III | BOWM_01_1 | 2008-08-14 | 20.93003 | 0.2611056 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| ARS III | BOWM_02_2 | 2018-05-09 | 140.21149 | 1.5598954 | 0.050 | 7.00 | 0.063 | 0.21 | 0.025 | 4.300 | 6.1 | 1.6 | 0.010 |
I then created Empirical Cumulative Distribution Functions using ecdf() (https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/ecdf) for each of the 9 parameters used in the SRB WQI. The example below shows the code required to create and ecdf function for aluminum.
Fn_al <- ecdf(wqi_df$Aluminum_mg_l)
Fn_al(0.05)
## [1] 0.2452598
Therefore, I had to manually adjust percentiles of detection limits to = 0 later on in the code. This was the biggest hang-up. After this step was completed, the code worked very similar to the scorer in excel.
wqi_scores <-
wqi_data_full %>%
mutate( # calculate percentiles
#Metals
Al_p = round(Fn_al(Aluminum_mg_l), 3),
Fe_p = round(Fn_fe(Iron_mg_l), 3),
Mn_p = round(Fn_mn(Manganese_mg_l), 3),
#Nutrients
NO3_p = round(Fn_no3(Nitrate_mg_l), 3),
P_p = round(Fn_p(Phosphorus_mg_l), 3),
TOC_p = round(Fn_toc(TOC_mg_l), 3),
#Development
Cl_p = round(Fn_cl(Chloride_mg_l), 3),
Na_p = round(Fn_na(Sodium_mg_l), 3),
SO4_p = round(Fn_so4(Sulfate_mg_l), 3)) %>%
mutate(# adjust percentiles to = 0
#Metals
Al_p = replace(Al_p, Al_p == 0.245, 0),
Fe_p = replace(Fe_p, Fe_p == 0.009, 0),
Mn_p = replace(Mn_p, Mn_p == 0.016, 0),
#Nutrients
P_p = replace(P_p, P_p == 0.113, 0),
TOC_p = replace(TOC_p, TOC_p == 0.015, 0),
#Development
Cl_p = replace(Cl_p, Cl_p == 0.017, 0),
SO4_p = replace(SO4_p, SO4_p == 0.001, 0)) %>%
mutate( #QC check to make sure categories are not scored if > 1 NA
Metals_cat_NA_check = rowSums(is.na(cbind(Al_p, Fe_p, Mn_p))),
Nutrients_cat_NA_check = rowSums(is.na(cbind(NO3_p, P_p, TOC_p))),
Development_cat_NA_check = rowSums(is.na(cbind(Cl_p, Na_p, SO4_p)))
) %>%
mutate(#Calculate category scores and include case_when statements for acute thresholds
# to 0 out scores
Metals_cat = case_when(
Aluminum_mg_l >= 0.75 ~ 0,
Iron_mg_l >= 1.5 ~ 0,
Manganese_mg_l >= 1 ~ 0,
rowSums(is.na(cbind(Al_p, Fe_p, Mn_p))) > 1 ~ NA_real_, # Don't score if > 1 NA
TRUE ~ (round(1 - (rowMeans(cbind(Al_p, Fe_p, Mn_p), na.rm = T)), 3))*100),
Nutrients_cat = case_when(
Nitrate_mg_l >= 10 ~ 0,
rowSums(is.na(cbind(NO3_p, P_p, TOC_p))) > 1 ~ NA_real_, # Don't score if > 1 NA
TRUE ~ (round(1 - (rowMeans(cbind(NO3_p, P_p, TOC_p), na.rm = T)), 3))*100),
Development_cat = case_when(
Chloride_mg_l >= 250 ~ 0,
Sulfate_mg_l >= 250 ~ 0,
rowSums(is.na(cbind(Cl_p, Na_p, SO4_p))) > 1 ~ NA_real_, # Don't score if > 1 NA
TRUE ~ (round(1 - (rowMeans(cbind(Cl_p, Na_p, SO4_p), na.rm = T)), 3))*100)
) %>%
mutate( #Calculate SRB WQI
SRB_WQI = round(rowMeans(cbind(Metals_cat, Nutrients_cat, Development_cat)), 1)
#, na.rm = T - since not included, if one or more Categories are NA, SRB_WQI will = 'NaN'
) %>%
mutate( # Break SRB WQI scores into classes
Classification = cut(SRB_WQI, breaks=c(-Inf, 31, 42.99, 62, 84.99, Inf),
labels=c('Very Poor', 'Poor', 'Fair', 'Good', 'Excellent'))
) %>%
mutate( #Flow and TSS flags
TSS_flag = ifelse(TSS_mg_l >= 80, 'CAUTION', ''),
Flow_flag = ifelse(cfm >= 3.5, 'CAUTION', ''))
After the code was written to calculate category and overall SRB WQI scores, it was time to QC the output by comparing it to the excel scoring output. The results of the QC check were good! All SRB WQI scores were <= 2.09 points different, most were extremely close. The two observations that were > 2.09 points off (3.5 and 5.9) were due to issues addressed below that were not a result of errors in the code.
The majority of the discrepencies were due to the fact that in the excel scorer, we could still generate a category score if > 1 parameters were NA. However, they were dealt with later on in the workflow and the overall SRB WQI score could not be generated. In this R scorer, the category scores were not calculated if > 1 parameters were NA. This issue was responsible for the few large discrepencies in category scores.
For instance, in the metals category in the above figure, there were 3 sites that had deltas of 17.1, 52.9, and 83. The larger of the two deltas were due to the fact that there were 2 NAs in the metals category for these sites. The excel scorer allowed the metals category to be scored, the R scorer did not. The site with a delta of 17.1 had a Mn concentration of 1.0 mg/L. This was our threshold for zeroing out the metals category. I coded the zero out as Mn >= 1 in R; in excel it was coded as > 1 mg/L. This was responsible for the delta and can be changed if necessary. The remainder of sites (120 total) had deltas <= 2.77.
In the nutrients category, there were 6 sites that had deltas > 5 (range 5.1 - 100). These were all cases where there were 2 NAs in the nutrient parameters and excel allowed a category score but R did not. The rest of the deltas were <= 2.83 points
The development category was extremely consistent, with all deltas <= 0.9.
The deltas in the overall SRB WQI score were <= 2.09 for all sites. The only exception was the site where Mn was 1.0 mg/L. The metals category got zeroed out in R (WQI = 16.7) but did not in excel (WQI = 22.6). Speaking generally, a difference of ~ 2 WQI points appears to be an acceptable level of variation, because the scoring classifications (e.g. good, fair, poor) will be unaffected unless the score is right on the boundary. This is especially true when considering the application for this code, where we will average multiple samples across multiple years to determine a site’s SRB WQI score.
Site-level info, including but not-limited to: station name, alias id, project, date, time, ecoregion, drainage area, lat, long, etc…
Flag/caution info: TSS concentration and flow (can calculate cfm if DA and flow included above)
At a minimum, concentrations of Metals (Al, Fe, Mn), Nutrients (NO3, P, TOC), and Development (Cl, Na, SO4). Other parameters can be attached but will not factor into the WQI score
The beauty of calculating WQI scores in R with this code is that it will keep the original dataframe intact with all of the site-level info and additional information, but will attach the following output:
These 18 columns will be attached to the original dataframe, and can be exported to excel, used in GIS, mapped in R, summarized by site, etc…
Here is an example of what the output will look like (just a subset of columns)
| unique_id_chrono | Al_p | Fe_p | Mn_p | NO3_p | P_p | TOC_p | Cl_p | Na_p | SO4_p | Metals_cat | Nutrients_cat | Development_cat | SRB_WQI | Classification | TSS_flag | Flow_flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BALD_01_2 | 0.454 | 0.626 | 0.589 | 0.142 | 0.518 | 0.000 | 0.224 | 0.286 | 0.418 | 44.4 | 78.0 | 69.1 | 63.8 | Good | ||
| BALD_01_1 | NA | 0.737 | 0.675 | NA | NA | NA | 0.346 | 0.465 | 0.587 | 29.4 | NA | 53.4 | NA | NA | CAUTION | NA |
| BOWM_01_3 | 0.510 | 0.554 | 0.554 | 0.442 | 0.418 | 0.372 | 0.390 | 0.459 | 0.140 | 46.1 | 58.9 | 67.0 | 57.3 | Fair | CAUTION | |
| BOWM_01_2 | 0.427 | 0.409 | 0.202 | 0.317 | 0.382 | 0.260 | 0.244 | 0.294 | 0.185 | 65.4 | 68.0 | 75.9 | 69.8 | Good | ||
| BOWM_01_1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||
| BOWM_02_2 | 0.000 | 0.198 | 0.202 | 0.266 | 0.453 | 0.299 | 0.331 | 0.374 | 0.129 | 86.7 | 66.1 | 72.2 | 75.0 | Good | CAUTION |