Carson Lee Project Presentation

Odell Lee

2024-12-08

Link to presentation

Project Background

The Sammamish River flows north from the outlet of Lake Sammamish to Lake Washington, and is unusual as it flows between two lakes. The Sammamish River has been extensively modified since the early 1900s to reduce flooding impacts and accommodate various human activities such as navigation and floodplain agriculture. The urban and agricultural development, coupled with the loss of native vegetation, has deteriorated water quality along the river. This development has reduced the natural ability to regulate water quality especially with regards to temperature and Dissolved Oxygen (DO).

Sammamish River

Issue Description

This work will be used as part of a larger project that is creating a water quality model of the Sammamish River. For my data analysis project, I will examine water quality data from the river that will later be used in the model. More specifically, I will:

Use R to parse, organize, and analyze the data
Do some statistical analysis on the data
Look at replicant sample data to examine data reliability
From this data, look at selected parameters where there may be problems

HEC-RAS Model of Sammamish

Questions

This project will look at data which is publicly available. This data is water quality data collected during a 2015 study. The data mostly contains information from discreet samples collected during two synoptic periods over the summer.

Questions:

Using the field replicates collected during the synoptic periods, are there any parameters that fall outside of the recommended values? The target values for this is a Median Relative Standard Deviation (RSD) between 5% and 20%, and an average bias between 5% and 10% depending on the parameter.
Based on the initial assessment, examine the full data set of parameters where there could be quality control problems.

Statistical analysis of the data will be done to aid in its assessment and box plots will be used to visually assess data reliability.

Example table of data statistics

Data Source: Water Quality Parameters

The water quality data source is through Washington state’s Environmental Information Management (EIM) system. (https://apps.ecology.wa.gov/eim/search/Eim/EIMSearch.aspx?SearchType=AllEIM&State=newsearch&Section=study)

Search for study ID MROS0001
Over 2200 discrete measurements collected, mostly during two synoptic periods in the summer of 2015
Will do statistical analsysis on the field duplicates to examine data quality

EIM search page

Data Documentation

There is abundant information describing every aspect of the data available here http://ecyeim/eimhelp/

Data documentation, dictionary, and description

Description of the Data: EIM dataset

EIM discrete data has 92 columns!
We will import and use select() to whittle it down to 7
Data is standardized and therefore ideal for using R!

#Load tidyverse
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#Now let's look at the EIM data
setwd("C:/Users/olee461/OneDrive - Washington State Executive Branch Agencies/Desktop/Courses/CSC360/For course/Class project/other data and scripts/chemical parameters")

#Step 2: Load your weather file 
EIM <- read.csv("EIMDiscreteResults_2023Dec18_2234.csv",header=T,sep=",")

#Note, it makes no sense to look at the structure and summary of the whole EIM data file, so we will select a few relevant paramters

rel_EIM <- EIM %>% 
  select(Location_Name, Field_Collection_Start_Date, Field_Collection_Comment, 
         Sample_ID, Result_Parameter_Name, Result_Value, Sample_Replicate_Flag)
str(rel_EIM)

## 'data.frame':    2234 obs. of  7 variables:
##  $ Location_Name              : chr  "BJW504 INSTREAM PIEZOMETER@ 08-SAMM-10.6" "BJW516 INSTREAM PIEZOMETER@ 08-SAMM-4.5" "08-SAMM-12.8" "08-SAMM-12.8" ...
##  $ Field_Collection_Start_Date: chr  "06/01/2015" "06/02/2015" "06/16/2015" "06/16/2015" ...
##  $ Field_Collection_Comment   : chr  "" "" "" "" ...
##  $ Sample_ID                  : chr  "" "" "" "" ...
##  $ Result_Parameter_Name      : chr  "Vertical Hydraulic Gradient" "Vertical Hydraulic Gradient" "Temperature, water" "Conductivity, Specific (at 25 deg C)" ...
##  $ Result_Value               : num  0.007 0.008 21.8 112.5 7.66 ...
##  $ Sample_Replicate_Flag      : chr  "N" "N" "" "" ...

summary(rel_EIM)

##  Location_Name      Field_Collection_Start_Date Field_Collection_Comment
##  Length:2234        Length:2234                 Length:2234             
##  Class :character   Class :character            Class :character        
##  Mode  :character   Mode  :character            Mode  :character        
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##   Sample_ID         Result_Parameter_Name  Result_Value     
##  Length:2234        Length:2234           Min.   :     0.0  
##  Class :character   Class :character      1st Qu.:     0.1  
##  Mode  :character   Mode  :character      Median :     3.2  
##                                           Mean   :  1943.7  
##                                           3rd Qu.:    16.4  
##                                           Max.   :414000.0  
##                                           NA's   :52        
##  Sample_Replicate_Flag
##  Length:2234          
##  Class :character     
##  Mode  :character     
##                       
##                       
##                       
##

Cleaning and Preparation: EIM replicant data

Now we will look at the EIM data First we will import it and select some relevant columns

#This script is meant to look at field replicant data and produce a summary table of statistics
#This will produce a summary of all the data, Sammamish only data, and tributary data

#First we need to clear the working environment, including all variables and data frames from the last chunk
rm(list = ls())

#Step 1: Set your working directory
setwd("C:/Users/olee461/OneDrive - Washington State Executive Branch Agencies/Desktop/Courses/CSC360/For course/Class project/other data and scripts/chemical parameters")

#Step 2: Load your weather file 
df <- read.csv("EIMDiscreteResults_2023Dec18_2234.csv",header=T,sep=",")


#Need to select relevant data.  This will give us an overview of the data
df1 <- df %>% 
  select(Location_Name,
         Field_Collection_Start_Date,
         Field_Collection_Comment, 
         Sample_ID, 
         Sample_Replicate_Flag, 
         Result_Parameter_Name, 
         Result_Value,
         Result_Value_Units,
         Result_Reporting_Limit,
         Result_Data_Qualifier_Description,
         Calculated_Latitude_Decimal_Degrees_NAD83HARN,
         Calculated_Longitude_Decimal_Degrees_NAD83HARN) %>% 
  mutate(Field_Collection_Start_Date = as.POSIXct(Field_Collection_Start_Date, format = "%m/%d/%Y")) %>%
  #mutate(Field_Collection_Start_Date_Time = as.POSIXct(Field_Collection_Start_Date_Time, format = "%m/%d/%Y %H:%M")) %>%
  #filter(Field_Collection_Start_Date_Time >= start_time & Field_Collection_Start_Date_Time <= end_time) #Comment this out if not filtering for dates
  arrange(Field_Collection_Start_Date)

Refine and add

Data frame showing Field Collection Comment

Adding columns with replicant and replicated sample ID numbers

#Will whittle down the selection and add two important columns with information from the Field_Collection_Comment parameter
#We will pull two numbers from this field.  The first being the replicant, the second being the replicated
#The first number is pulled after the characters "Sample" are read and before the characters "is"
#The second number is at the end of the field
df2 <- df1 %>% 
  select(Location_Name, Field_Collection_Start_Date, Field_Collection_Comment, 
         Sample_ID, Result_Parameter_Name, Result_Value, Sample_Replicate_Flag) %>% 
  mutate(replicant = str_extract(Field_Collection_Comment, "(?<=Sample )\\d+-\\d+(?= is)"),
         replicated =str_extract(Field_Collection_Comment, "\\d{7}-\\d{2}$")) %>% 
  arrange(Sample_ID)

#Remove all rows without sample IDs
df3 <- df2[df2$Sample_ID != "", ]

Showing replicant and replicated columns

Can further divide the data by main River or Tributary

head(df3$Location_Name)

## [1] "08-SAMMT-12.3" "08-SAMM-9.6"   "08-SAMMT-12.3" "08-SAMMT-12.3"
## [5] "08-SAMM-12.8"  "08-SAMM-4.5"

#One thing that could still be done is to split the values being the main Sammamish and tributaries
#All Sammamish data has 08_SAMM in the location name while tributary data has 08-SAMMT
#Now remove any data that is not on the main Sammamish
df3.1 <- df3[grep("^08-SAMM(?!T)", df3$Location_Name, perl = TRUE), ]
#For the tributaries use this
df3.2 <- df3[grep("^08-SAMMT", df3$Location_Name, perl = TRUE), ]

The tricky part!!!

We want to match the replicated sample IDs and parameters with the replicant

#Now we will match our replicant and replicated values per parameter!
#Although this looks simple, it took me forever to figure it out!
df4 <- df3 %>% 
  left_join(df3, by = c("replicated" = "Sample_ID", "Result_Parameter_Name" = "Result_Parameter_Name"), 
            suffix = c("", "replicated")) %>% 
  select(Sample_ID, Field_Collection_Comment, Result_Parameter_Name, Result_Value, replicated, Result_Valuereplicated)

#Remove all rows with missing or NA values.  Note: This data has been checked to make sure it is accurate and nothing is being left out
df5 <- df4 %>% 
  na.omit()

Dataframe with proper values

How has data changed

Observations and variables of different data frames

Statistics and data summary

#Now do some statistics for each row
df5$std_dev <- apply(df5[, c("Result_Value", "Result_Valuereplicated")], 1, sd)
df5$mean <- apply(df5[, c("Result_Value", "Result_Valuereplicated")], 1, mean)
df5$rel_std_dev <- (df5$std_dev/df5$mean)*100
df5$Absolut_error <- abs(df5$Result_Value - df5$Result_Valuereplicated)
df5$bias <- df5$Result_Value - df5$Result_Valuereplicated

#Get rid of all the scientific notation by selecting a large number
options(scipen = 999)

#Make a table summarizing all the results for each parameter
Summary_Table_All <- df5 %>% 
  group_by(Result_Parameter_Name) %>% 
  summarize(Number_Duplicates = length(Result_Value), Average_Value = mean(Result_Valuereplicated), 
            Max_Value = max(Result_Valuereplicated), Medain_RSD = median(rel_std_dev), 
            Mean_RSD = mean(rel_std_dev), Max_rsd = max(rel_std_dev),
            Average_Bias = mean(bias), Max_Bias = max(abs(bias)), 
            Percent_Avg_Bias = Average_Bias/Average_Value*100) %>% 
  mutate(across(where(is.numeric), ~ signif(.x, digits = 3)))  #Reduce numeric values to 3 significant digits


print(Summary_Table_All)

## # A tibble: 19 × 10
##    Result_Parameter_Name    Number_Duplicates Average_Value Max_Value Medain_RSD
##    <chr>                                <dbl>         <dbl>     <dbl>      <dbl>
##  1 Alkalinity, Total as Ca…                11       84.2      1.58e+2       0   
##  2 Ammonia                                 49        0.0304   4.46e-1       0   
##  3 Areal Biomass as Ash-Fr…                 2      407        5.93e+2      26   
##  4 Areal Biomass as Chloro…                 2        0.92     1.7 e+0      27.5 
##  5 Biochemical Oxygen Dema…                10        2        2   e+0       0   
##  6 Chloride                                10        4.68     5.62e+0       0   
##  7 Chlorophyll                             18    22700        2.83e+5       1.06
##  8 Dissolved Organic Carbon                10        2.96     3.7 e+0       0   
##  9 Nitrate + Nitrite as N                  47        0.374    1.85e+0       0   
## 10 Ortho-Phosphate                         44        0.0358   9.35e-2       0   
## 11 Percent Solids                           2        6.25     1.18e+1      26.5 
## 12 Phosphorus                               4     1740        3.66e+3       6.99
## 13 Total Non-Volatile Susp…                16        1.62     5   e+0       0   
## 14 Total Organic Carbon                    10        3.34     3.7 e+0       0   
## 15 Total Persulfate Nitrog…                47        0.575    1.46e+0       0   
## 16 Total Phosphorus                        47        0.0716   6.29e-1       0   
## 17 Total Suspended Solids                   9        2.11     5   e+0       0   
## 18 Turbidity                               10        1.73     2.6 e+0       0   
## 19 Volatile Organic Matter                  4    27100        6.42e+4      30.1 
## # ℹ 5 more variables: Mean_RSD <dbl>, Max_rsd <dbl>, Average_Bias <dbl>,
## #   Max_Bias <dbl>, Percent_Avg_Bias <dbl>

Final Results: Analyze the data

Summary table of replicant statistics

Will first look at the Percent average bias as a key indicator Some parameters seem to have problems, but there are very few samples collected for them, so we will exclude them.

Revised table

It looks like ammonia, chlorophyll, and Alkalinity should be investigated further.

Summary_table_revised <- Summary_Table_All %>% 
  filter(Number_Duplicates > 9)

Revised Summary Table

Further data examination

We will look at all the data for these 3 parameters and consider if the data was collected on the main river or a tributary

#First select the parameters of interest.  For us that is Alkalinity, and chlorophyll.  Since the scales are so different for each parameter, we will need to use a log scale
#First look at all of the data
df3.01 <- df3 %>% 
  filter(Result_Parameter_Name == "Alkalinity, Total as CaCO3" | Result_Parameter_Name == "Ammonia" | Result_Parameter_Name == "Chlorophyll") %>% 
   mutate(Result_Value = ifelse(Result_Parameter_Name == "Chlorophyll", Result_Value / 1000, Result_Value)) %>% 
  arrange(Location_Name)

ggplot(df3.01, aes(x = Result_Parameter_Name, y = Result_Value, fill = Result_Parameter_Name)) +
  geom_boxplot() +
    scale_y_log10(
    breaks = c(0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000),  # Manually specify breaks (every order of magnitude)
    labels = c("0.01", "0.1", "1", "10", "100", "1000", "10000", "100000", "1000000")  # Custom labels, you can adjust them as needed
  ) +
  labs(title = "Boxplot of relevant parameters for river and tributaries",
       x = "Parameter", y = "Concentration, mg/L") +
  theme_classic()

Looking at only the Sammamish data

#Now look at the Sammamish river data without tributaries
df3.11 <- df3.1 %>% 
  filter(Result_Parameter_Name == "Alkalinity, Total as CaCO3" | Result_Parameter_Name == "Ammonia" | Result_Parameter_Name == "Chlorophyll") %>% 
    mutate(Result_Value = ifelse(Result_Parameter_Name == "Chlorophyll", Result_Value / 1000, Result_Value)) %>% 
  arrange(Location_Name)

ggplot(df3.11, aes(x = Result_Parameter_Name, y = Result_Value, fill = Result_Parameter_Name)) +
  geom_boxplot() +
   scale_y_log10(
    breaks = c(0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000),  # Manually specify breaks (every order of magnitude)
    labels = c("0.01", "0.1", "1", "10", "100", "1000", "10000", "100000", "1000000")  # Custom labels, you can adjust them as needed
  ) +
  labs(title = "Boxplot of relevant parameters on the Sammamish River",
       x = "Parameter", y = "Concentration, mg/L") +
  theme_classic()

Now look at only the tributary data

#Now repeat for the tributaries
df3.21 <- df3.2 %>% 
  filter(Result_Parameter_Name == "Alkalinity, Total as CaCO3" | Result_Parameter_Name == "Ammonia" | Result_Parameter_Name == "Chlorophyll") %>% 
   mutate(Result_Value = ifelse(Result_Parameter_Name == "Chlorophyll", Result_Value / 1000, Result_Value)) %>% 
  arrange(Location_Name)

ggplot(df3.21, aes(x = Result_Parameter_Name, y = Result_Value, fill = Result_Parameter_Name)) +
  geom_boxplot() +
   scale_y_log10(
    breaks = c(0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000),  # Manually specify breaks (every order of magnitude)
    labels = c("0.01", "0.1", "1", "10", "100", "1000", "10000", "100000", "1000000")  # Custom labels, you can adjust them as needed
  ) +
  labs(title = "Boxplot of relevant parameters on the tributaries",
       x = "Parameter", y = "Concentration, mg/L") +
  theme_classic()

Further analysis

Map of Agriculture District on Sammamish River

TPN loading on Sammamish and tributaries

Conclusions

By using R to parse, organize, and do some statistical analysis on our datasets, we have a much better ideal of:

The range and scope of our data
The quality of the data
How this data can be further used to give us meaningful answers for further modeling

Although the initial set up was time consuming, the code can be easily reused saving time in the long run!