Quality Assurance of WQN data using R

This is an attempt to QA WQN data to accomplish 4 steps in an automated fashion:

  1. Include flags for field chemistry – pH, DO, %DO, Temp, and SPC – that are out of range.
  2. Check if the sample has lab chemistry data included.
  3. Check if field chemistry was entered.
  4. If Reason Code = 002, then SAC021 should be present.

Necessary packages:

library(readxl)
library(janitor)
library(plyr)
library(lubridate)
library(writexl)
library(tidyverse)
library(tidylog) 

First we will import the WQN data that Molly provided from quarter 1 of water year 2019.

  • the read_excel() function imports the specified tab of the spreadsheet in the working directory as a tibble
  • the clean_names() function in the janitor package cleans up column names (e.g. replaces spaces with underscores, removes special characters, etc…)
  • the mutate() function with the as.numeric() function converts the final_amount column to a numeric (double) column
  • this is saved to a new object (tibble) called ‘wqn_df’
wqn_df <- read_excel('PANTN_WY19_Q1_data.xlsx', sheet = 'Sheet1') %>% 
  janitor::clean_names() %>% 
  mutate(
    final_amount = as.numeric(final_amount))

This is what the dataframe looks like after import. There are 6,524 rows and 19 columns.

collector wqn station_name usgs_gauge sampling_reason_code standard_analysis_or_suite date_collected time_collected sequence_number quality_assurance_type_desc test_code test_short_desc test_long_desc reading_indicator_code final_amount final_amount_units_abbrev quality_code standard_comment_desc free_form_comment
SRBC WQN0201 SUSQUEHANNA RVR 01576000 1 012 2018-10-09 08:30:00 08:30 287 NULL 00410 ALKALINITY ALKALINITY AS CaCO3 @ pH 4.5 NULL 49.200 mg/L NULL NULL NULL
SRBC WQN0201 SUSQUEHANNA RVR 01576000 1 012 2018-10-09 08:30:00 08:30 287 NULL 01105A ALUMINUM T ALUMINUM, TOTAL (WATER & WASTE) BY ICP NULL 1160.000 ug/L NULL Currently not certified for drinking water NULL
SRBC WQN0201 SUSQUEHANNA RVR 01576000 1 012 2018-10-09 08:30:00 08:30 287 NULL 00608A AMMONIA-N D AMMONIA DISSOLVED AS NITROGEN NULL 0.021 mg/L NULL NULL Dissolved result > 10% Relative Difference of total result.

Next we will convert the data from long to wide format

  • the first step will be to create a new object called ‘wqn_df_wide’
  • then use the unite() function (similar to concatenate in excel) to smash 6 columns together to create a new variable called ‘unique’
    • the contents of the columns are now separated with a ’_’
    • this will be the unique identifier variable used to cast the data to wide format, and will appear as unique rows
  • then use the unite() function again to create a unique ‘analyte’ variable, containing test code, test description, and units
  • we select the above two variables and the ‘final_amount’ [concentration] variable, dropping everything else
  • the spread() function casts the data to wide format, with unique id in rows and unique analytes in columns, filled with final_amount [concentration]
  • then use the separate() function to re-separate the unique identifier variable so we can look at each independently
    • specify that ’_’ is used as the separator
  • since the analyte field is now column headers, and this contains spaces and special characters, we need to use the clean_names() function again to address
  • now we add a series of columns using the mutate() function to begin the QA check process
    • calculate the number of field parameters included in each sample (field_num)
    • calculate the number of field parameters missing in each sample (field_na)
    • calculate the number of lab parameters included in each sample (lab_num)
    • calculate the number of lab parameters missing in each sample (lab_na)
    • a series of out of range (oor) columns to flag each time a field parameter is outside of a specified range
      • returns NA if field parameter is not outside of the range
      • range can be adjusted
    • sort output by collector, wqn and date
# reshape to wide
wqn_df_wide <-  
  wqn_df %>% 
  
  # create unique column for sample ID
  unite(unique, collector, wqn, sampling_reason_code, sequence_number, 
        quality_assurance_type_desc, date_collected) %>% 
  
  #create unique column for analyte
  unite(analyte, test_code, test_short_desc, final_amount_units_abbrev) %>% 
  
  #select columns to spread
  select(unique, analyte, final_amount) %>% 
  
  #make wide
  spread(analyte, final_amount) %>% 
  
  # separate unique identifier back into original columns
  separate(unique, c('collector', 'wqn', 'sampling_reason_code', 'sequence_number', 
                     'quality_assurance_type_desc', 'date_collected'), sep = "_") %>% 
  
  #clean col names
  janitor::clean_names() %>% 
  
  mutate(
    # number of NON NA field params
    field_num = rowSums(!is.na(cbind(
      f00061_stream_flow_cfs, f00094_specific_con_umhos_cm, f00300_oxygen_fld_mg_l,
      f00301_do_percent_field_percent, f00405_p_h_fld_p_h_units, f82078_turbidity_fi_ntu))),
    
    # number of NA field params
    field_na = rowSums(is.na(cbind(
      f00061_stream_flow_cfs, f00094_specific_con_umhos_cm, f00300_oxygen_fld_mg_l,
      f00301_do_percent_field_percent, f00405_p_h_fld_p_h_units, f82078_turbidity_fi_ntu))),
    
    # number of NON NA lab params
    lab_num = rowSums(!is.na(cbind(
      x00095_spc_25_0_c_umhos_cm, x00403_p_h_p_h_units, x00403t_p_h_temp_c,
      x00410_alkalinity_mg_l, x00530_t_susp_solid_mg_l, x00600a_nitrogentot_mg_l, 
      x00602a_dissolved_n_mg_l, x00608a_ammonia_n_d_mg_l, x00610a_ammonia_n_t_mg_l, 
      x00630a_no3_no2_n_mg_l, x00631a_no3_no2_n_d_mg_l, x00665a_phosphorus_t_mg_l, 
      x00666a_phosphorus_d_mg_l, x00671a_p_ortho_diss_mg_l, x00680_t_org_carbon_mg_l, 
      x00900_hardness_t_mg_l, x00916a_calcium_t_mg_l, x00927a_magnesium_t_mg_l, 
      x00929a_sodium_t_mg_l, x00937a_potassium_t_mg_l, x00940_chloride_ic_mg_l, 
      x00945_sulfate_ic_mg_l, x01007a_barium_t_ug_l, x01022k_boron_total_ug_l,
      x01042h_copper_t_ug_l, x01045a_iron_t_ug_l, x01051h_lead_t_ug_l, 
      x01055a_manganese_t_ug_l, x01067a_nickel_t_ug_l, x01082a_strontium_t_ug_l, 
      x01092a_zinc_t_ug_l, x01105a_aluminum_t_ug_l, x01132a_lithium_t_ug_l, 
      x01147h_selenium_t_ug_l, x70300u_tds180_usgs_mg_l, x70507a_phos_t_ortho_mg_l, 
      x82550_osmo_pres_null, x99020_bromide_ug_l))),
    
    # number of NA field params
    lab_na = rowSums(is.na(cbind(
      x00095_spc_25_0_c_umhos_cm, x00403_p_h_p_h_units, x00403t_p_h_temp_c,
      x00410_alkalinity_mg_l, x00530_t_susp_solid_mg_l, x00600a_nitrogentot_mg_l, 
      x00602a_dissolved_n_mg_l, x00608a_ammonia_n_d_mg_l, x00610a_ammonia_n_t_mg_l, 
      x00630a_no3_no2_n_mg_l, x00631a_no3_no2_n_d_mg_l, x00665a_phosphorus_t_mg_l, 
      x00666a_phosphorus_d_mg_l, x00671a_p_ortho_diss_mg_l, x00680_t_org_carbon_mg_l, 
      x00900_hardness_t_mg_l, x00916a_calcium_t_mg_l, x00927a_magnesium_t_mg_l, 
      x00929a_sodium_t_mg_l, x00937a_potassium_t_mg_l, x00940_chloride_ic_mg_l, 
      x00945_sulfate_ic_mg_l, x01007a_barium_t_ug_l, x01022k_boron_total_ug_l,
      x01042h_copper_t_ug_l, x01045a_iron_t_ug_l, x01051h_lead_t_ug_l, 
      x01055a_manganese_t_ug_l, x01067a_nickel_t_ug_l, x01082a_strontium_t_ug_l, 
      x01092a_zinc_t_ug_l, x01105a_aluminum_t_ug_l, x01132a_lithium_t_ug_l, 
      x01147h_selenium_t_ug_l, x70300u_tds180_usgs_mg_l, x70507a_phos_t_ortho_mg_l, 
      x82550_osmo_pres_null, x99020_bromide_ug_l))),
    
    #field chem out of range flags
    # SpC outside of 0-1000 range
    spc_oor = case_when(
      f00094_specific_con_umhos_cm < 0 ~ 'SpC_low',
      f00094_specific_con_umhos_cm > 1000 ~ 'SpC_high',
      TRUE ~ NA_character_),
    
    # DO outside of 4-15 range
    do_oor = case_when(
      f00300_oxygen_fld_mg_l < 4 ~'DO_low',
      f00300_oxygen_fld_mg_l > 15 ~ 'DO_high',
      TRUE ~ NA_character_),
    
    # pH outside of 4-10 range
        pH_oor = case_when(
      f00405_p_h_fld_p_h_units < 4 ~ 'pH_low',
      f00405_p_h_fld_p_h_units > 10 ~ 'pH_high',
      TRUE ~ NA_character_),
    
    # turb outside of 0-1000 range
    turb_oor = case_when(
      f82078_turbidity_fi_ntu < 0 ~ 'turb_low',
      f82078_turbidity_fi_ntu > 1000 ~ 'turb_high',
      TRUE ~ NA_character_)
    ) %>% 
  
  #sort
  arrange(collector, wqn, date_collected)

The resulting dataframe includes 180 unique sampling events and 59 columns. A subset of columns and rows are shown below, including the unique identifiers, a couple columns of lab/field parameters, and the columns we created for QA purposes.

collector wqn sampling_reason_code sequence_number quality_assurance_type_desc date_collected x00095_spc_25_0_c_umhos_cm f00405_p_h_fld_p_h_units f82078_turbidity_fi_ntu field_num field_na lab_num lab_na spc_oor do_oor pH_oor turb_oor
SRBC WQN0201 1 287 NULL 2018-10-09 08:30:00 184.2 7.54 4 2 32 6
SRBC WQN0201 1 847 NULL 2018-10-29 10:00:00 7.86 4 2 11 27
SRBC WQN0201 1 288 NULL 2018-11-08 08:30:00 164.0 7.50 4 2 32 6

Now that we have created the wide dataframe and QA columns, we can specifically address each QA criteria:

1. Include flags for field chemistry – pH, DO, %DO, Temp, and SPC – that are out of range (oor).
  • First create a new object called ‘field_oor’
  • Then select only the columns of interest
  • then use the filter() function to subset only the samples that contain at least one oor flag
field_oor <- 
  wqn_df_wide %>% 
  
  #select only columns of interest
  select(collector, wqn, sampling_reason_code, sequence_number, quality_assurance_type_desc, date_collected,
         f00010_water_temp_c, f00061_stream_flow_cfs, f00094_specific_con_umhos_cm, f00300_oxygen_fld_mg_l, 
         f00301_do_percent_field_percent, f00405_p_h_fld_p_h_units, f82078_turbidity_fi_ntu,
         spc_oor, do_oor, pH_oor, turb_oor) %>% 
  
  #remove samples with no field chem flags
  filter(!is.na(spc_oor) | !is.na(do_oor) | !is.na(pH_oor) | !is.na(turb_oor))

The 180 unique samples included in the original dataset contained only 2 records with at least one field chemistry measurement oor. It happened to be 2 occassions where DO > 15 mg/l. One was 15.2 (believable); one was 93.7 (sounds more like % saturation – suspect).

collector wqn sampling_reason_code sequence_number quality_assurance_type_desc date_collected f00010_water_temp_c f00061_stream_flow_cfs f00094_specific_con_umhos_cm f00300_oxygen_fld_mg_l f00301_do_percent_field_percent f00405_p_h_fld_p_h_units f82078_turbidity_fi_ntu spc_oor do_oor pH_oor turb_oor
SRBC WQN0281 1 146 NULL 2018-12-11 15:00:00 3.07 16.8 541 15.2 8.25 DO_high
SRBC WQN0301 15 806 NULL 2018-11-28 10:30:00 3.05 89600.0 142 93.7 7.43 DO_high
2. Check if the sample has lab chemistry data included.

The ‘wqn_df_wide’ object we created earlier contains a count of the number of lab parameters - and lab na’s - for each sample. However, if we just want to subset the samples that contained no lab parameters, we can created a new object called ‘lab_chem_check’, apply another filter(), and select only columns of interest.

lab_chem_check <- 
  wqn_df_wide %>% 
  
  #subset sites with no lab chem
  filter(lab_num == 0) %>% 
  
  #select only columns of interest
  select(collector, wqn, sampling_reason_code, sequence_number, 
         quality_assurance_type_desc, date_collected, lab_num, lab_na)

The output identifies 4 samples that have no lab chemistry - a possible QA issue to follow up on?

collector wqn sampling_reason_code sequence_number quality_assurance_type_desc date_collected lab_num lab_na
USGS WQN0269 1 607 NULL 2018-12-13 13:00:00 0 38
USGS WQN0278 1 606 NULL 2018-12-13 11:45:00 0 38
USGS WQN0280 1 605 NULL 2018-12-13 10:15:00 0 38
USGS WQN0317 14 158 NULL 2018-11-20 12:00:00 0 38
3. Check if field chemistry was entered.

The ‘wqn_df_wide’ object we created earlier also contains a count of the number of field parameters - and field na’s - for each sample. Since Molly mentioned that 4 parameters are usually required, here we subset the number of samples that have < 4 field chemistry parameters to a new object called ‘field_chem_check’.

field_chem_check <- 
  wqn_df_wide %>% 

  #subset sites with < 4 field chem entries
  filter(field_num < 4) %>% 
  
  #select only columns of interest
  select(collector, wqn, sampling_reason_code, sequence_number,
         quality_assurance_type_desc, date_collected, field_num, field_na)

The output identifies 24 samples that have no field chemistry.

collector wqn sampling_reason_code sequence_number quality_assurance_type_desc date_collected field_num field_na
SRBC WQN0204 15 267 Duplicate 2018-11-07 09:30:00 0 6
SRBC WQN0214 1 280 Blank 2018-11-14 09:30:00 0 6
SRBC WQN0214 1 281 Blank 2018-11-14 09:30:00 0 6

However, if the sample was a duplicate or blank, no field chemistry was required. So, lets take another step to identify how many samples should have had field chem. We’ll use the group_by() functon followed by a pipe and the summarise() function to determine how many blanks, dups, etc.. The result is 11 blanks and 9 dups that don’t need field chem, but 4 ‘NULL’ samples that should have field chem – a possible QA issue to follow up on?

quality_assurance_type_desc Unique_Elements
Blank 11
Duplicate 9
NULL 4

4. If Reason Code = 002, then SAC021 should be present.

This last chunk will use the original dataframe (wqn_df; long format) to search for cases when reason code = 2 and sac does not = 021. We can first use the unique() function to search for the unique elements of reason and sac codes. It’s apparent that this dataset doesn’t contain either, but the code that creates the ‘sac_conflict’ object will identify the issue if/when it occurs.

unique(wqn_df$sampling_reason_code) # "1"  "15" "14"
unique(wqn_df$standard_analysis_or_suite) # "012"  "NULL" "874"  "612"

sac_conflict <- 
  wqn_df %>% 
  
  filter(sampling_reason_code == 2 & standard_analysis_or_suite != '021')

Now that we have completed the 4 QA steps:

We can build a spreadsheet called ‘wqn_qa.xlsx’ containing all the output using the writexl package, write_xlsx() function to our working directory. We can include 4 different tabs with our output from the steps above.

write_xlsx(list(wqn_df_wide = wqn_df_wide, 
                field_oor = field_oor,
                field_chem_check = field_chem_check, 
                lab_chem_check = lab_chem_check,
                sac_conflict = sac_conflict), 
           "wqn_qa.xlsx")