SRB WQI Scoring in R

The current MS Excel WQI calculator does the job, but is slow and labor intensive. The objective here is to execute the same scoring paradigm in R to speed up processing time. This document will cover:

the coding necessary to complete batch WQI scoring in R,
a QC check of R WQI scoring compared to excel, and
the data requirements for future batch scoring.

First, I imported two dataframes to R. The first was the complete WQI dataset containing data from 2008-2017 (n=8118 observations) for all parameters that the WQI scoring is based on (wqi_df), i.e. the ‘All Data’ tab in the excel WQI calculator. Here is a view of the first couple columns.

StationID	StationName	AliasID	Date	Aluminum_mg_l	Chloride_mg_l	Iron_mg_l	Nitrate_mg_l	Phosphorus_mg_l	Sodium_mg_l	Sulfate_mg_l	TOC_mg_l	Manganese_mg_l
474	CHEM002.5-4176	CHEM 2.5	9/14/2010	0.200	69.0	0.082	0.66	0.102	44.8	21.0	2.83	0.022
1919	HLDN000.0-00001	Holden Ck US	5/1/2012	0.160	7.2	0.230	0.10	0.026	5.4	9.2	3.10	0.012
1920	HLDN000.0-00002	Holden Ck DS	5/1/2012	0.160	7.2	0.230	0.10	0.026	5.4	9.2	3.10	0.012
1931	COWNR00.0-00004	Cowanesque R US	5/31/2012	0.056	10.0	0.120	0.19	0.021	6.9	10.0	2.80	0.027
1932	COWNR00.0-00005	Cowanesque R DS	5/31/2012	0.056	10.0	0.120	0.19	0.021	6.9	10.0	2.80	0.027
466	CNST038.7-4277	CNST 38.7	6/19/2012	0.072	55.2	0.190	2.00	0.200	27.2	27.9	3.50	0.035

I then imported a second dataframe of sample data from the ARS Program Refinement effort (n= 120) that was scored previously using the Excel scorer (wqi_data_full). Note - the dataframe has the same 9 water quality parameters. The columns are named identically to wqi_df, but they don’t have to be. The cfm column indicates flow (cfs) per square mile. The TSS column is generated from random numbers. These two columns are used for the caution flags based on predetermined thresholds. Notice there are NA values, that will not present a problem.

Project	unique_id_chrono	Date	TSS_mg_l	cfm	Aluminum_mg_l	Chloride_mg_l	Iron_mg_l	Nitrate_mg_l	Phosphorus_mg_l	Sodium_mg_l	Sulfate_mg_l	TOC_mg_l	Manganese_mg_l
ARS III	BALD_01_2	2012-05-07	40.04289	0.9762082	0.095	4.20	0.220	0.11	0.029	3.100	9.5	0.5	0.035
ARS III	BALD_01_1	2010-07-14	137.04372	NA	NA	7.53	0.319	NA	NA	5.778	13.0	NA	0.049
ARS III	BOWM_01_3	2016-09-19	183.37515	0.9475834	0.110	8.90	0.180	0.39	0.023	5.600	6.2	1.8	0.031
ARS III	BOWM_01_2	2012-06-06	56.87989	1.1008200	0.087	4.70	0.120	0.25	0.021	3.200	6.6	1.5	0.010
ARS III	BOWM_01_1	2008-08-14	20.93003	0.2611056	NA	NA	NA	NA	NA	NA	NA	NA	NA
ARS III	BOWM_02_2	2018-05-09	140.21149	1.5598954	0.050	7.00	0.063	0.21	0.025	4.300	6.1	1.6	0.010

Code necessary to complete batch WQI scoring in R

I then created Empirical Cumulative Distribution Functions using ecdf() (https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/ecdf) for each of the 9 parameters used in the SRB WQI. The example below shows the code required to create and ecdf function for aluminum.

Fn_al <- ecdf(wqi_df$Aluminum_mg_l)

This is where things got a little wild. The ecdf function in R gives a percentile of a value, which is defined as ‘number of values less than or equal a certain value’. This complicated things because when entering the detection limit (0.05 mg/L) for Al, the percentile returned is NOT 0 as it is using the =PERCENTRANK.INC function in excel

Fn_al(0.05)

## [1] 0.2452598

Therefore, I had to manually adjust percentiles of detection limits to = 0 later on in the code. This was the biggest hang-up. After this step was completed, the code worked very similar to the scorer in excel.

Percentile distributions for each parameter

After the ecdf functions were completed for each of the 9 WQI parameters, I used the following code to calculate scores.

The first mutate() function calculates percentiles for each parameter using the previously discussed ecdf() functions.
The second mutate() function replaces the percentiles of detection limit observations with 0 so that the scoring is identical (as possible) to our excel scorer
The third mutate() creates three columns that I used as a check to make sure the scorer was working correctly. This basically sums the NA values in each category. As we know, if a category has > 1 NA it should not be scored.
The fourth mutate() performs three different operations to determine category scores for Metals, Nutrients, and Development:
- incorporates the acute thresholds we set that 0 out the category score,
- prevents scoring if > 1 parameter in the category is NA, and
- calculates the inverse of the average of categories that do not meet the above two criteria, and puts the score on a 0 - 100 scale
The fifth mutate() calculates the overall SRB WQI score by averaging each category. If one or more categories are NA, no score is returned
The sixth mutate() classifies sites into Very Poor, Poor, Fair, Good, and Excellent categories as defined in the WQI report
The sevnth and last mutate() returns ‘CAUTION’ if the TSS (>80 mg/l) or flow (3.5 cfm) thresholds are exceeded

wqi_scores <- 
  wqi_data_full %>% 
  mutate( # calculate percentiles 
    #Metals
    Al_p = round(Fn_al(Aluminum_mg_l), 3),
    Fe_p = round(Fn_fe(Iron_mg_l), 3),
    Mn_p = round(Fn_mn(Manganese_mg_l), 3),
    
    #Nutrients
    NO3_p = round(Fn_no3(Nitrate_mg_l), 3),
    P_p = round(Fn_p(Phosphorus_mg_l), 3),
    TOC_p = round(Fn_toc(TOC_mg_l), 3),
    
    #Development
    Cl_p = round(Fn_cl(Chloride_mg_l), 3),
    Na_p = round(Fn_na(Sodium_mg_l), 3),
    SO4_p = round(Fn_so4(Sulfate_mg_l), 3)) %>% 
  
  mutate(# adjust percentiles to = 0
    #Metals
    Al_p = replace(Al_p, Al_p == 0.245, 0),
    Fe_p = replace(Fe_p, Fe_p == 0.009, 0),
    Mn_p = replace(Mn_p, Mn_p == 0.016, 0),
    
    #Nutrients
    P_p = replace(P_p, P_p == 0.113, 0),
    TOC_p = replace(TOC_p, TOC_p == 0.015, 0),
    
    #Development
    Cl_p = replace(Cl_p, Cl_p == 0.017, 0),
    SO4_p = replace(SO4_p, SO4_p == 0.001, 0)) %>% 

   mutate( #QC check to make sure categories are not scored if > 1 NA
     Metals_cat_NA_check = rowSums(is.na(cbind(Al_p, Fe_p, Mn_p))),
     Nutrients_cat_NA_check = rowSums(is.na(cbind(NO3_p, P_p, TOC_p))),
     Development_cat_NA_check = rowSums(is.na(cbind(Cl_p, Na_p, SO4_p)))
   ) %>% 

  
  mutate(#Calculate category scores and include case_when statements for acute thresholds
         # to 0 out scores
    Metals_cat = case_when(
      Aluminum_mg_l >= 0.75 ~ 0,
      Iron_mg_l >= 1.5 ~ 0,
      Manganese_mg_l >= 1 ~ 0,
      rowSums(is.na(cbind(Al_p, Fe_p, Mn_p))) > 1 ~ NA_real_,  # Don't score if > 1 NA
      TRUE ~ (round(1 - (rowMeans(cbind(Al_p, Fe_p, Mn_p), na.rm = T)), 3))*100),
    
    Nutrients_cat = case_when(
      Nitrate_mg_l >= 10 ~ 0,
      rowSums(is.na(cbind(NO3_p, P_p, TOC_p))) > 1 ~ NA_real_,  # Don't score if > 1 NA
      TRUE ~ (round(1 - (rowMeans(cbind(NO3_p, P_p, TOC_p), na.rm = T)), 3))*100),
      
    Development_cat = case_when(
      Chloride_mg_l >= 250 ~ 0,
      Sulfate_mg_l >= 250 ~ 0,
      rowSums(is.na(cbind(Cl_p, Na_p, SO4_p))) > 1 ~ NA_real_,  # Don't score if > 1 NA
      TRUE ~ (round(1 - (rowMeans(cbind(Cl_p, Na_p, SO4_p), na.rm = T)), 3))*100)
    ) %>% 
    
  
  mutate( #Calculate SRB WQI
    SRB_WQI = round(rowMeans(cbind(Metals_cat, Nutrients_cat, Development_cat)), 1) 
    #, na.rm = T - since not included, if one or more Categories are NA, SRB_WQI will = 'NaN'
  ) %>% 
    
  
  mutate( # Break SRB WQI scores into classes
    Classification = cut(SRB_WQI, breaks=c(-Inf, 31, 42.99, 62, 84.99, Inf), 
                         labels=c('Very Poor', 'Poor', 'Fair', 'Good', 'Excellent'))
  ) %>% 

  
  mutate( #Flow and TSS flags
    TSS_flag = ifelse(TSS_mg_l >= 80, 'CAUTION', ''),
    Flow_flag = ifelse(cfm >= 3.5, 'CAUTION', ''))

QC check of R WQI scoring compared to excel

After the code was written to calculate category and overall SRB WQI scores, it was time to QC the output by comparing it to the excel scoring output. The results of the QC check were good! All SRB WQI scores were <= 2.09 points different, most were extremely close. The two observations that were > 2.09 points off (3.5 and 5.9) were due to issues addressed below that were not a result of errors in the code.

The majority of the discrepencies were due to the fact that in the excel scorer, we could still generate a category score if > 1 parameters were NA. However, they were dealt with later on in the workflow and the overall SRB WQI score could not be generated. In this R scorer, the category scores were not calculated if > 1 parameters were NA. This issue was responsible for the few large discrepencies in category scores.

For instance, in the metals category in the above figure, there were 3 sites that had deltas of 17.1, 52.9, and 83. The larger of the two deltas were due to the fact that there were 2 NAs in the metals category for these sites. The excel scorer allowed the metals category to be scored, the R scorer did not. The site with a delta of 17.1 had a Mn concentration of 1.0 mg/L. This was our threshold for zeroing out the metals category. I coded the zero out as Mn >= 1 in R; in excel it was coded as > 1 mg/L. This was responsible for the delta and can be changed if necessary. The remainder of sites (120 total) had deltas <= 2.77.

In the nutrients category, there were 6 sites that had deltas > 5 (range 5.1 - 100). These were all cases where there were 2 NAs in the nutrient parameters and excel allowed a category score but R did not. The rest of the deltas were <= 2.83 points

The development category was extremely consistent, with all deltas <= 0.9.

The deltas in the overall SRB WQI score were <= 2.09 for all sites. The only exception was the site where Mn was 1.0 mg/L. The metals category got zeroed out in R (WQI = 16.7) but did not in excel (WQI = 22.6). Speaking generally, a difference of ~ 2 WQI points appears to be an acceptable level of variation, because the scoring classifications (e.g. good, fair, poor) will be unaffected unless the score is right on the boundary. This is especially true when considering the application for this code, where we will average multiple samples across multiple years to determine a site’s SRB WQI score.

Data requirements for future batch scoring

Site-level info, including but not-limited to: station name, alias id, project, date, time, ecoregion, drainage area, lat, long, etc…
Flag/caution info: TSS concentration and flow (can calculate cfm if DA and flow included above)
At a minimum, concentrations of Metals (Al, Fe, Mn), Nutrients (NO3, P, TOC), and Development (Cl, Na, SO4). Other parameters can be attached but will not factor into the WQI score

Output

The beauty of calculating WQI scores in R with this code is that it will keep the original dataframe intact with all of the site-level info and additional information, but will attach the following output:

Percentiles of individual parameters [9 columns]
NA check (sum of NAs for each category). This could be removed as its unessential once we’re satisfied with QC [3 columns]
Scores for each category (Metals, Nutrients, Development) and the SRB WQI (aggregate) score [4 columns]
Flags/caution for TSS and Flow [2 columns]

These 18 columns will be attached to the original dataframe, and can be exported to excel, used in GIS, mapped in R, summarized by site, etc…

Here is an example of what the output will look like (just a subset of columns)

unique_id_chrono	Al_p	Fe_p	Mn_p	NO3_p	P_p	TOC_p	Cl_p	Na_p	SO4_p	Metals_cat	Nutrients_cat	Development_cat	SRB_WQI	Classification	TSS_flag	Flow_flag
BALD_01_2	0.454	0.626	0.589	0.142	0.518	0.000	0.224	0.286	0.418	44.4	78.0	69.1	63.8	Good
BALD_01_1	NA	0.737	0.675	NA	NA	NA	0.346	0.465	0.587	29.4	NA	53.4	NA	NA	CAUTION	NA
BOWM_01_3	0.510	0.554	0.554	0.442	0.418	0.372	0.390	0.459	0.140	46.1	58.9	67.0	57.3	Fair	CAUTION
BOWM_01_2	0.427	0.409	0.202	0.317	0.382	0.260	0.244	0.294	0.185	65.4	68.0	75.9	69.8	Good
BOWM_01_1	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
BOWM_02_2	0.000	0.198	0.202	0.266	0.453	0.299	0.331	0.374	0.129	86.7	66.1	72.2	75.0	Good	CAUTION