Code written by Sarah Smith using data collected and provided by the Lake Auburn Watershed Protection Commission and Holly Ewing. Data analysis were used to support the community-engaged research conducted by Sarah Smith, Lili Missbrenner, and Jonah Yaffe for ENVR417.
The water quality data analysis conducted in this research was broken up by sampling type, since the data were collected at in-lake sampling sites and perimeter sampling sites. First, I started with analyzing/visualizing the perimeter data. There were several different data sets within the 2010-2025 time chunk, so I had to read in the data separately and clean the data frames before joining all of the data together. This process also involved filtering out sampling sites with high sample abundance. Then, I was able to calculate means and visualize the data in different ways. Many of the graphs included in this code were not used in the report but helped shape the figures that were included. Additionally, I attempted to hide some of the outputs of certain code chunks that would make this document difficult to read.
First, Library all of the necessary packages.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
Warning: Values from `Result.Value` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} |>
dplyr::summarise(n = dplyr::n(), .by = c(Year, Day, Date, Time, Location,
Parameter)) |>
dplyr::filter(n > 1L)
Make a conductivity scatterplot to help visualize the data
PER_cond <-ggplot(data=PER_10_25_COND, aes(x=date_new, y=conductivity_u_s))+geom_point()+labs(x='Date of Sample', y='Conductivity (uS)', title='Conductivity of Perimeter Samples in Lake Auburn from 2010-2025')+theme_classic()PER_cond
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
mean_PER_cond_3 <-ggplot(data=avg_PER_COND_3, aes(x=location, y=meanCOND))+geom_jitter(data=PER_10_25_COND, size=.5, width=.4, aes(x=location, y=conductivity_u_s), col='orange',alpha= .5)+geom_point(col="darkblue")+geom_errorbar(data=avg_PER_COND_3, aes(x=location, ymin=meanCOND-se, ymax=meanCOND+se), width=.2, col="darkblue")+labs(x='Site', y='Mean Conductivity (uS/cm)', title='Mean Conductivity of Perimeter Samples in Lake Auburn from 2010-2025')+theme_bw()+theme(plot.title =element_text(hjust =0.5))mean_PER_cond_3
Warning: Removed 4 rows containing missing values or values outside the scale range
(`geom_point()`).
now make a boxplot for the data organized by site to help visualize where the outliers are
PER_cond_boxplot <-ggplot(data=PER_10_25_COND, aes(x=location, y=conductivity_u_s))+geom_boxplot()+labs(x='Sampling Site', y='Conductivity (uS)', title='Conductivity of Perimeter Sampling Sites in Lake Auburn from 2010-2025')+theme_classic()+theme(plot.title =element_text(hjust =0.5))PER_cond_boxplot
Warning: Removed 4 rows containing non-finite outside the scale range
(`stat_boxplot()`).
To help with visualizing conductivity and chloride loading trends across space and over time, make two new data frames, one with data from 2014-2019 and one with data from 2020-2025
make a new data frame that filters out any conductivity measurements above 800 as this is unlikely to be accurate
PER_10_25_COND_filtered$year <-as.factor(PER_10_25_COND_filtered$year)per_cond_14_19 <- PER_10_25_COND_filtered|>filter(year =="2014"| year =="2015"| year =="2016"| year =="2017"| year=='2018'| year =="2019")per_cond_20_25 <- PER_10_25_COND_filtered|>filter(year =="2020"| year =="2021"| year =="2022"| year =="2023"| year=='2024'| year =="2025")
now calculate mean conductvity by site for those chunks of years
make a graph with each of the year chunks plotted for each location
mean_per_cond14_19_20_25 <-ggplot(data=avg_per_cond_14_19, aes(x=location, y=meanCOND))+geom_point(size=2)+geom_errorbar(data=avg_per_cond_14_19, aes(x=location, ymin=meanCOND-se, ymax=meanCOND+se), width=.3)+geom_point(data=avg_per_cond_20_25, aes(x=location, y=meanCOND), size=2)+geom_errorbar(data=avg_per_cond_20_25, aes(x=location, ymin=meanCOND-se, ymax=meanCOND+se), width=.3)+labs(x='Site', y='Mean Conductivity (uS/cm)', title='Mean Conductivity of Perimeter Samples in Lake Auburn from 2014-2019 and 2020-2025')+theme_bw()+scale_color_bmj()+theme(plot.title =element_text(hjust =0.5))+theme(axis.title =element_text(size=18))mean_per_cond14_19_20_25
mutate a new column with the equation for the chloride conversion in the conductivity data frame for each time chunk, this will be in mEq so convert it to mg/L by multiplying by the atomic weight
now make 2 data frames, one with flow data from 2014-2019 and 2020-2025
discharge_14_25_NEW$year <-as.factor(discharge_14_25_NEW$year)discharge_14_19 <- discharge_14_25_NEW|>filter(year =="2014"| year =="2015"| year =="2016"| year =="2017"| year=='2018'| year =="2019")discharge_20_25 <- discharge_14_25_NEW|>filter(year =="2020"| year =="2021"| year =="2022"| year =="2023"| year=='2024'| year =="2025")
now select for only the sites we have avg conductivity for
avg_discharge_20_25$site <-as.factor(avg_discharge_20_25$site)avg_discharge_20_25_new <- avg_discharge_20_25|>filter(site =="13"| site =="16"| site =="17"| site =="18"| site =="2"| site =="23"| site =="25"| site =="26"| site =="27"| site =="3"| site =="4"| site =="5A"| site =="B-1"| site =="R-2"| site =="TBR")
avg_discharge_14_19$site <-as.factor(avg_discharge_14_19$site)avg_discharge_14_19_new <- avg_discharge_14_19|>filter(site =="13"| site =="16"| site =="17"| site =="18"| site =="2"| site =="23"| site =="25"| site =="26"| site =="27"| site =="3"| site =="4"| site =="5A"| site =="B-1"| site =="R-2"| site =="TBR")
now filter the median data sets the same way
med_discharge_20_25$site <-as.factor(med_discharge_20_25$site)med_discharge_20_25_new <- med_discharge_20_25|>filter(site =="13"| site =="16"| site =="17"| site =="18"| site =="2"| site =="23"| site =="25"| site =="26"| site =="27"| site =="3"| site =="4"| site =="5A"| site =="B-1"| site =="R-2"| site =="TBR")
med_discharge_14_19$site <-as.factor(med_discharge_14_19$site)med_discharge_14_19_new <- med_discharge_14_19|>filter(site =="13"| site =="16"| site =="17"| site =="18"| site =="2"| site =="23"| site =="25"| site =="26"| site =="27"| site =="3"| site =="4"| site =="5A"| site =="B-1"| site =="R-2"| site =="TBR")
now add the avg and median discharge in the next column by site, this places all of the calculations in the same data set the column names are important to be able to distinguish between the data and different units, etc.!
avg_load_14_19 <-merge(avg_per_cond_14_19_new, avg_discharge_14_19_new, by =c("site"="site"))
avg_load_20_25 <-merge(avg_per_cond_20_25, avg_discharge_20_25_new, by =c("site"="site"))
avg_load_14_19 <-merge(avg_load_14_19, med_discharge_14_19_new, by =c("site"="site"))
avg_load_20_25 <-merge(avg_load_20_25, med_discharge_20_25_new, by =c("site"="site"))
now add a column that multiplies chloride concentration by discharge to get load in mg/sec
Warning: Duplicated aesthetics after name standardisation: size
mean_per_cond25
make a new data frame with just turbidity (across the whole time span), drop any rows with NULL values and filter any turbidity measurements over 60 because this is almost impossible
make a turbidity scatterplot to visualize the data
PER_turb <-ggplot(data=PER_10_25_TURB, aes(x=date_new, y=turbidity_ntu))+#geom_point()+geom_smooth()+labs(x='Date of Sample', y='Turbidity (NTU)', title='Turbidity of Perimeter Samples in Lake Auburn from 2010-2025')+theme_classic()PER_turb
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
mean_PER_turb_filtered <-ggplot(data=avg_PER_TURB_filtered, aes(x=location, y=meanTURB))+geom_jitter(data=PER_10_25_TURB, size=.5, width=.4, aes(x=location, y=turbidity_ntu), alpha= .65, col='orange')+geom_point(col='darkblue', size=2)+geom_errorbar(data=avg_PER_TURB_filtered, aes(x= location, ymin=meanTURB-se, ymax=meanTURB+se), width=.5, col='darkblue')+labs(x='Site', y='Mean Turbidity (NTU)', title='Mean Turbidity of Perimeter Samples in Lake Auburn from 2010-2025')+theme_bw()+theme(plot.title =element_text(hjust =0.5))+theme(axis.title =element_text(size=18))mean_PER_turb_filtered
Next, analyze the in-lake data. These data also had to be combined across several different data sets in various formats. There was a large data set from 2010-2022, but the most recent years had to be added.
Warning: Values from `Result` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} |>
dplyr::summarise(n = dplyr::n(), .by = c(SAMPLE_DATE, SAMPLE_TIME, StationID,
SAMPLE_DEPTH, CORE, Comments, Flag, Type, Source, SAMPLE_LOCATION, Units,
Parameter)) |>
dplyr::filter(n > 1L)
Warning: Values from `Result` are not uniquely identified; output will contain
list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} |>
dplyr::summarise(n = dplyr::n(), .by = c(Year, Day, Date, Time, Location,
Station_Number, Location_Name, Depth..m., Core.Depth..m., Units, RL, MDL,
Flag, Comments, Lab, Equipment.used, Serial.Number, Parameter)) |>
dplyr::filter(n > 1L)
in_lake_23_wide
in_lake_23_clean <-clean_names(in_lake_23_wide)|> dplyr::select(year, date, time, location, station_number, depth_m, conductivity, tds)
now try by the linear regression by date instead of year and try using the segmented package
lm_2 <-lm(conductivity ~ date2, data = in_lake_10_25newer)summary(lm_2)segs<-segmented(lm_2, seg.Z=~date2, npsi=2)summary(segs) #this provided me with an R squared value augseg<-augment(segs)
Warning: The `augment()` method for objects of class `segmented` is not maintained by the broom team, and is only supported through the `lm` tidier method. Please be cautious in interpreting and reporting broom output.
This warning is displayed once per session.
# do the davies test to see if the slopes differ from one another?davies.test(segs, k=10)#estimation of breakpoint locationsconfint(segs)intercept(segs)slope(segs)summary(segs)
attempt another linear regression using the new date column, however the segmented linear regression is likely more helpful to see the changes in trends over time
now look at the in-lake turbidity data. There were very few turbidity data points included in the in-lake data sets that we were given, so after trying to calculate mean turbidity by year and realizing that there was only a few years with turbidity data, this analysis mainly just focused on getting a turbidity graph for 2025.
start by reading in the turbidity data for 2025 (this was part of the 2025 Watershed Sampling Results that we were given)