The current MS Excel WQI calculator does the job, but is slow and labor intensive. The objective here is to execute the same scoring paradigm in R to speed up processing time. This document will cover:
  1. the coding necessary to complete batch WQI scoring in R,
  2. a QC check of R WQI scoring compared to excel, and
  3. the data requirements for future batch scoring.

First, I imported two dataframes to R. The first was the complete WQI dataset containing data from 2008-2017 (n=8118 observations) for all parameters that the WQI scoring is based on (wqi_df), i.e. the ‘All Data’ tab in the excel WQI calculator. Here is a view of the first couple columns.

StationID StationName AliasID Date Aluminum_mg_l Chloride_mg_l Iron_mg_l Nitrate_mg_l Phosphorus_mg_l Sodium_mg_l Sulfate_mg_l TOC_mg_l Manganese_mg_l
474 CHEM002.5-4176 CHEM 2.5 9/14/2010 0.200 69.0 0.082 0.66 0.102 44.8 21.0 2.83 0.022
1919 HLDN000.0-00001 Holden Ck US 5/1/2012 0.160 7.2 0.230 0.10 0.026 5.4 9.2 3.10 0.012
1920 HLDN000.0-00002 Holden Ck DS 5/1/2012 0.160 7.2 0.230 0.10 0.026 5.4 9.2 3.10 0.012
1931 COWNR00.0-00004 Cowanesque R US 5/31/2012 0.056 10.0 0.120 0.19 0.021 6.9 10.0 2.80 0.027
1932 COWNR00.0-00005 Cowanesque R DS 5/31/2012 0.056 10.0 0.120 0.19 0.021 6.9 10.0 2.80 0.027
466 CNST038.7-4277 CNST 38.7 6/19/2012 0.072 55.2 0.190 2.00 0.200 27.2 27.9 3.50 0.035

I then imported a second dataframe of sample data from the ARS Program Refinement effort (n= 120) that was scored previously using the Excel scorer (wqi_data_full). Note - the dataframe has the same 9 water quality parameters. The columns are named identically to wqi_df, but they don’t have to be. The cfm column indicates flow (cfs) per square mile. The TSS column is generated from random numbers. These two columns are used for the caution flags based on predetermined thresholds. Notice there are NA values, that will not present a problem.

Project unique_id_chrono Date TSS_mg_l cfm Aluminum_mg_l Chloride_mg_l Iron_mg_l Nitrate_mg_l Phosphorus_mg_l Sodium_mg_l Sulfate_mg_l TOC_mg_l Manganese_mg_l
ARS III BALD_01_2 2012-05-07 40.04289 0.9762082 0.095 4.20 0.220 0.11 0.029 3.100 9.5 0.5 0.035
ARS III BALD_01_1 2010-07-14 137.04372 NA NA 7.53 0.319 NA NA 5.778 13.0 NA 0.049
ARS III BOWM_01_3 2016-09-19 183.37515 0.9475834 0.110 8.90 0.180 0.39 0.023 5.600 6.2 1.8 0.031
ARS III BOWM_01_2 2012-06-06 56.87989 1.1008200 0.087 4.70 0.120 0.25 0.021 3.200 6.6 1.5 0.010
ARS III BOWM_01_1 2008-08-14 20.93003 0.2611056 NA NA NA NA NA NA NA NA NA
ARS III BOWM_02_2 2018-05-09 140.21149 1.5598954 0.050 7.00 0.063 0.21 0.025 4.300 6.1 1.6 0.010

Code necessary to complete batch WQI scoring in R

I then created Empirical Cumulative Distribution Functions using ecdf() (https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/ecdf) for each of the 9 parameters used in the SRB WQI. The example below shows the code required to create and ecdf function for aluminum.

Fn_al <- ecdf(wqi_df$Aluminum_mg_l) 
This is where things got a little wild. The ecdf function in R gives a percentile of a value, which is defined as ‘number of values less than or equal a certain value’. This complicated things because when entering the detection limit (0.05 mg/L) for Al, the percentile returned is NOT 0 as it is using the =PERCENTRANK.INC function in excel
Fn_al(0.05)
## [1] 0.2452598

Therefore, I had to manually adjust percentiles of detection limits to = 0 later on in the code. This was the biggest hang-up. After this step was completed, the code worked very similar to the scorer in excel.

Percentile distributions for each parameter

After the ecdf functions were completed for each of the 9 WQI parameters, I used the following code to calculate scores.

  • The first mutate() function calculates percentiles for each parameter using the previously discussed ecdf() functions.
  • The second mutate() function replaces the percentiles of detection limit observations with 0 so that the scoring is identical (as possible) to our excel scorer
  • The third mutate() creates three columns that I used as a check to make sure the scorer was working correctly. This basically sums the NA values in each category. As we know, if a category has > 1 NA it should not be scored.
  • The fourth mutate() performs three different operations to determine category scores for Metals, Nutrients, and Development:
    • incorporates the acute thresholds we set that 0 out the category score,
    • prevents scoring if > 1 parameter in the category is NA, and
    • calculates the inverse of the average of categories that do not meet the above two criteria, and puts the score on a 0 - 100 scale
  • The fifth mutate() calculates the overall SRB WQI score by averaging each category. If one or more categories are NA, no score is returned
  • The sixth mutate() classifies sites into Very Poor, Poor, Fair, Good, and Excellent categories as defined in the WQI report
  • The sevnth and last mutate() returns ‘CAUTION’ if the TSS (>80 mg/l) or flow (3.5 cfm) thresholds are exceeded
wqi_scores <- 
  wqi_data_full %>% 
  mutate( # calculate percentiles 
    #Metals
    Al_p = round(Fn_al(Aluminum_mg_l), 3),
    Fe_p = round(Fn_fe(Iron_mg_l), 3),
    Mn_p = round(Fn_mn(Manganese_mg_l), 3),
    
    #Nutrients
    NO3_p = round(Fn_no3(Nitrate_mg_l), 3),
    P_p = round(Fn_p(Phosphorus_mg_l), 3),
    TOC_p = round(Fn_toc(TOC_mg_l), 3),
    
    #Development
    Cl_p = round(Fn_cl(Chloride_mg_l), 3),
    Na_p = round(Fn_na(Sodium_mg_l), 3),
    SO4_p = round(Fn_so4(Sulfate_mg_l), 3)) %>% 
  
  mutate(# adjust percentiles to = 0
    #Metals
    Al_p = replace(Al_p, Al_p == 0.245, 0),
    Fe_p = replace(Fe_p, Fe_p == 0.009, 0),
    Mn_p = replace(Mn_p, Mn_p == 0.016, 0),
    
    #Nutrients
    P_p = replace(P_p, P_p == 0.113, 0),
    TOC_p = replace(TOC_p, TOC_p == 0.015, 0),
    
    #Development
    Cl_p = replace(Cl_p, Cl_p == 0.017, 0),
    SO4_p = replace(SO4_p, SO4_p == 0.001, 0)) %>% 

   mutate( #QC check to make sure categories are not scored if > 1 NA
     Metals_cat_NA_check = rowSums(is.na(cbind(Al_p, Fe_p, Mn_p))),
     Nutrients_cat_NA_check = rowSums(is.na(cbind(NO3_p, P_p, TOC_p))),
     Development_cat_NA_check = rowSums(is.na(cbind(Cl_p, Na_p, SO4_p)))
   ) %>% 

  
  mutate(#Calculate category scores and include case_when statements for acute thresholds
         # to 0 out scores
    Metals_cat = case_when(
      Aluminum_mg_l >= 0.75 ~ 0,
      Iron_mg_l >= 1.5 ~ 0,
      Manganese_mg_l >= 1 ~ 0,
      rowSums(is.na(cbind(Al_p, Fe_p, Mn_p))) > 1 ~ NA_real_,  # Don't score if > 1 NA
      TRUE ~ (round(1 - (rowMeans(cbind(Al_p, Fe_p, Mn_p), na.rm = T)), 3))*100),
    
    Nutrients_cat = case_when(
      Nitrate_mg_l >= 10 ~ 0,
      rowSums(is.na(cbind(NO3_p, P_p, TOC_p))) > 1 ~ NA_real_,  # Don't score if > 1 NA
      TRUE ~ (round(1 - (rowMeans(cbind(NO3_p, P_p, TOC_p), na.rm = T)), 3))*100),
      
    Development_cat = case_when(
      Chloride_mg_l >= 250 ~ 0,
      Sulfate_mg_l >= 250 ~ 0,
      rowSums(is.na(cbind(Cl_p, Na_p, SO4_p))) > 1 ~ NA_real_,  # Don't score if > 1 NA
      TRUE ~ (round(1 - (rowMeans(cbind(Cl_p, Na_p, SO4_p), na.rm = T)), 3))*100)
    ) %>% 
    
  
  mutate( #Calculate SRB WQI
    SRB_WQI = round(rowMeans(cbind(Metals_cat, Nutrients_cat, Development_cat)), 1) 
    #, na.rm = T - since not included, if one or more Categories are NA, SRB_WQI will = 'NaN'
  ) %>% 
    
  
  mutate( # Break SRB WQI scores into classes
    Classification = cut(SRB_WQI, breaks=c(-Inf, 31, 42.99, 62, 84.99, Inf), 
                         labels=c('Very Poor', 'Poor', 'Fair', 'Good', 'Excellent'))
  ) %>% 

  
  mutate( #Flow and TSS flags
    TSS_flag = ifelse(TSS_mg_l >= 80, 'CAUTION', ''),
    Flow_flag = ifelse(cfm >= 3.5, 'CAUTION', ''))

QC check of R WQI scoring compared to excel

After the code was written to calculate category and overall SRB WQI scores, it was time to QC the output by comparing it to the excel scoring output. The results of the QC check were good! All SRB WQI scores were <= 2.09 points different, most were extremely close. The two observations that were > 2.09 points off (3.5 and 5.9) were due to issues addressed below that were not a result of errors in the code.

The majority of the discrepencies were due to the fact that in the excel scorer, we could still generate a category score if > 1 parameters were NA. However, they were dealt with later on in the workflow and the overall SRB WQI score could not be generated. In this R scorer, the category scores were not calculated if > 1 parameters were NA. This issue was responsible for the few large discrepencies in category scores.

For instance, in the metals category in the above figure, there were 3 sites that had deltas of 17.1, 52.9, and 83. The larger of the two deltas were due to the fact that there were 2 NAs in the metals category for these sites. The excel scorer allowed the metals category to be scored, the R scorer did not. The site with a delta of 17.1 had a Mn concentration of 1.0 mg/L. This was our threshold for zeroing out the metals category. I coded the zero out as Mn >= 1 in R; in excel it was coded as > 1 mg/L. This was responsible for the delta and can be changed if necessary. The remainder of sites (120 total) had deltas <= 2.77.

In the nutrients category, there were 6 sites that had deltas > 5 (range 5.1 - 100). These were all cases where there were 2 NAs in the nutrient parameters and excel allowed a category score but R did not. The rest of the deltas were <= 2.83 points

The development category was extremely consistent, with all deltas <= 0.9.

The deltas in the overall SRB WQI score were <= 2.09 for all sites. The only exception was the site where Mn was 1.0 mg/L. The metals category got zeroed out in R (WQI = 16.7) but did not in excel (WQI = 22.6). Speaking generally, a difference of ~ 2 WQI points appears to be an acceptable level of variation, because the scoring classifications (e.g. good, fair, poor) will be unaffected unless the score is right on the boundary. This is especially true when considering the application for this code, where we will average multiple samples across multiple years to determine a site’s SRB WQI score.

Data requirements for future batch scoring

  1. Site-level info, including but not-limited to: station name, alias id, project, date, time, ecoregion, drainage area, lat, long, etc…

  2. Flag/caution info: TSS concentration and flow (can calculate cfm if DA and flow included above)

  3. At a minimum, concentrations of Metals (Al, Fe, Mn), Nutrients (NO3, P, TOC), and Development (Cl, Na, SO4). Other parameters can be attached but will not factor into the WQI score

Output

The beauty of calculating WQI scores in R with this code is that it will keep the original dataframe intact with all of the site-level info and additional information, but will attach the following output:

  1. Percentiles of individual parameters [9 columns]
  2. NA check (sum of NAs for each category). This could be removed as its unessential once we’re satisfied with QC [3 columns]
  3. Scores for each category (Metals, Nutrients, Development) and the SRB WQI (aggregate) score [4 columns]
  4. Flags/caution for TSS and Flow [2 columns]

These 18 columns will be attached to the original dataframe, and can be exported to excel, used in GIS, mapped in R, summarized by site, etc…

Here is an example of what the output will look like (just a subset of columns)

unique_id_chrono Al_p Fe_p Mn_p NO3_p P_p TOC_p Cl_p Na_p SO4_p Metals_cat Nutrients_cat Development_cat SRB_WQI Classification TSS_flag Flow_flag
BALD_01_2 0.454 0.626 0.589 0.142 0.518 0.000 0.224 0.286 0.418 44.4 78.0 69.1 63.8 Good
BALD_01_1 NA 0.737 0.675 NA NA NA 0.346 0.465 0.587 29.4 NA 53.4 NA NA CAUTION NA
BOWM_01_3 0.510 0.554 0.554 0.442 0.418 0.372 0.390 0.459 0.140 46.1 58.9 67.0 57.3 Fair CAUTION
BOWM_01_2 0.427 0.409 0.202 0.317 0.382 0.260 0.244 0.294 0.185 65.4 68.0 75.9 69.8 Good
BOWM_01_1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
BOWM_02_2 0.000 0.198 0.202 0.266 0.453 0.299 0.331 0.374 0.129 86.7 66.1 72.2 75.0 Good CAUTION