Nora Kammer’s work revisited

Author

Curtis DeGasperi

Published

April 24, 2025

Introduction

Over 20 years ago, Nora Kammer (2004) reported on analyses of King County Small Lakes data to address three hypotheses:

  1. As the proportion of high density land uses in lake watersheds increase, lakes will become more eutrophic.

  2. As the proportion of forested land uses in lake watersheds increase, lakes will become less eutrophic.

  3. Septic systems within lake watersheds will cause lakes to become more eutrophic.

Although not explicitly stated, Nora’s approach is often described as a space-for-time approach. Nora looked for correlations between independent variables (e.g., watershed %forest) and dependent variables (trophic state indicators chlorophyll a, total phosphorus, and secchi depth) at specific points in time.

The space-for-time approach uses spatial gradients of independent variables as a proxy for temporal changes in dependent variables. This approach is typically used because data were not collected at one or more individual locations over a period of time covering significant changes in the independent variables of interest.

Nora’s results were quite striking. High density development in a lake’s watershed was negatively correlated with lake trophic state indicators (higher density development ~ lower trophic state). and forest cover was positively correlated with trophic state indicators (higher forest cover ~ higher trophic state). A less counter intuitive finding was that lakes with sewered watersheds (i.e., few or no septic systems) tended to be less eutrophic.

Methods

I started by pulling the King Co Lake Stewardship data from our SQL database. I also downloaded National Lakes Assessment (NLA) data (here) as well as data from Western Washington University’s Institute for Watershed Studies (IWS) (here). The IWS samples a number of small lakes primarily in the northeast portion of the Puget Sound basin. I also requested and received lake data from Snohomish County.

I removed some lakes from the data set because they were rather large (Lake Sammamish) or they were highly altered or managed (e.g., alum treatments). These lakes included Capitol, Ketchum, and Stevens. I also removed some redundant data preferring Snohomish County data over data provided by NLA or IWS (e.g., Goodwin, Armstrong, and Sunday).

code to load and process the lake data
library(tidyverse)
library(lubridate)
library(readxl)
library(janitor)
library(RODBC)
library(sf)

library(conflicted)
conflict_prefer("filter", "dplyr", quiet = T)
conflict_prefer("select", "dplyr", quiet = T)

#########################################################
### Note that I haven't attempted to deal with nondetects
### nor have I assessed their frequency. I suspect nondetects
### for chlorophyll a, tp, and tn would be infrequent in lake
### surface waters during summer
### I also haven't tried to deal with secchi measurements
### that extend to the lake bottom. There are some in the 
### NLA dataset...perhaps some in the King Co and IWS data?
#########################################################

#########################################################
### get King Co data
#########################################################
ch <- odbcConnect("Small lakes")
kc <- sqlQuery(ch, paste("SELECT Site.ID, Site.SiteName, Site.Locator, WQData.SiteID, WQData.CollectDate, WQData.Depth, WQData.DepthQ, WQData.ChlorophyllA, WQData.ChloroQ, 
                         WQData.TotalPhosphorus, WQData.TotalPQ, WQData.TotalNitrogen, WQData.TotalNQ, WQData.secchi, WQData.secchiQ, WQData.UV254, WQData.UV254Q  
                            FROM WQData INNER JOIN Site ON WQData.SiteID = Site.ID"))

odbcClose(ch)

kc <- kc %>% filter(!SiteName %in% c('Hicklin','North','Steel')) %>% transmute(lake = SiteName, Locator = Locator, type = "King Co", year = year(as.Date(CollectDate)), month = month(as.Date(CollectDate)), day = day(as.Date(CollectDate)), depth = Depth, chla = ChlorophyllA, tp = TotalPhosphorus, tn = TotalNitrogen, secchi = secchi, tn_tp = tn/tp) 

# add Lat/Lon - some of these were adjusted so they would occur on the lake and intersect with National Hydrography Data (NHD) waterbodies and so they would work with the StreamCat package that extracts the LakeCat data

kcsites <- read_csv('kcsites.csv') %>%  transmute(lake=Lake, Locator=SITE_ID, lat = LAT_DD83, lon = LON_DD83)
kc <- left_join(kc,kcsites,by=c("lake","Locator")) %>% select(-Locator) %>% arrange(lake)

#########################################################
### get National Lakes Assessment (NLA) data 
#########################################################
### add dummy month = 8 NLA lakes sampled July-September
### a bit tricky because they sometimes enter different lat/lon for each sampling event
nla <- read_csv('./data/nla2007-2022_data_forpopestimates_indexvisits_probsites_0.csv') %>% filter(PSTL_CODE=='WA') %>%  transmute(SITE_ID = SITE_ID, type = 'NLA', year = DSGN_CYCLE, month = 8, day = 15, depth = 1, chla = CHLA_RESULT, tn = NTL_RESULT, tp =PTL_RESULT, tn_tp = tn/tp, lat = LAT_DD83, lon= LON_DD83)

#### secchi data is in separate files for some reason
secchi07 <- read_csv('data/2007/NLA2007_secchi_20091008.csv') %>% select(SITE_ID,SECMEAN) %>% 
  inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>% rename(SECCHI = SECMEAN) %>% group_by(SITE_ID) %>% summarize(SECCHI = mean(SECCHI)) %>% ungroup() 
# secchi07 <- inner_join(secchi07,sites[c('SITE_ID','PSTL_CODE')],by='SITE_ID')
secchi12 <- read_csv('data/2012/nla2012_secchi_08232016.csv') %>% select(SITE_ID,SECCHI) %>% 
  inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>% group_by(SITE_ID) %>% summarize(SECCHI = mean(SECCHI)) %>% ungroup() 
secchi17 <- read_csv('data/2017/nla_2017_secchi-data.csv') %>% select(SITE_ID,CLEAR_TO_BOTTOM,INDEX_SITE_DEPTH,DISAPPEARS,REAPPEARS) %>% 
  inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>% mutate(SECCHI = (DISAPPEARS+REAPPEARS)/2) %>% select(-DISAPPEARS,-REAPPEARS)
secchi22 <- read_csv('data/2022/nla22_secchi.csv') %>% select(SITE_ID,CLEAR_TO_BOTTOM,INDEX_SITE_DEPTH,DISAPPEARS,REAPPEARS) %>% 
  inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>% mutate(SECCHI = (DISAPPEARS+REAPPEARS)/2) %>% select(-DISAPPEARS,-REAPPEARS) %>% 
  group_by(SITE_ID,CLEAR_TO_BOTTOM) %>% summarize(SECCHI = mean(SECCHI),INDEX_SITE_DEPTH = mean(INDEX_SITE_DEPTH)) %>% ungroup() 
secchi.nla <- bind_rows(list(secchi07,secchi12,secchi17,secchi22))

nla <- left_join(nla,secchi.nla,by="SITE_ID") 

# The NLA data files do not include lake names so I had to first filter the national data to lakes in the Puget Lowland and then identified lake names using Google Maps
nlasites <- read_csv('sites_lake_names.csv') %>% rename(lake = Lake)

nla <- left_join(nlasites,nla,by="SITE_ID") %>% rename(secchi = SECCHI) %>% select(-SITE_ID, -CLEAR_TO_BOTTOM, -INDEX_SITE_DEPTH)

# pull out lat/lon and rejoin unique set for each lake
nla_ll <- nla %>% select(lake,lat,lon) %>% distinct(lake,.keep_all = TRUE)
nla <- select(nla,-lat,-lon)
nla <- left_join(nla,nla_ll,by='lake')

# I know that one lat/lon (Sunwood Lake) needs to be fixed based on previous experience
nla$lon[39]<- -122.774407
nla$lat[39]<- 46.970290

### remove some lakes that are redundant with lakes sampled by Snohomish Co. and remove
### Capitol Lake since it is a unicorn...heavily modified, tidal, enormous watershed (all Deschutes River)
### and Sammamish since it isn't a "small" lake
nla <- filter(nla,!lake %in% c("Capitol Lake","Lake Goodwin","Lake Armstrong","Lake Sammamish"))

#########################################################
### get Western Washington University Institute for Watershed Studies (IWS) data 
### note: no secchi data :-(
#########################################################
iws <- read_csv('data/IWSwaterqualitydata-all.csv', na = "NA")
# I fixed some of these coordinates to work with StreamCat
iws_lldd <- read_csv('data/lake_coordinates_all_mod.csv') %>% rename(key = code) %>% select(-mod) # %>% rename(lake = Lake) 
iws <- left_join(iws,iws_lldd,by="key") 
### beware....the line below removes the less than detection symbol ("<")
iws[] <- lapply(iws, gsub, pattern='<', replacement='')
# convert text values to numeric
iws <- iws %>% mutate_at(c(4:19,22:23), as.numeric) %>% select(lake,year,month,day,chl,tn,tp,lat,long) %>% rename(chla = chl, lon = long ) %>% mutate(type = 'IWS', depth = 1, tn_tp = tn/tp)

# I also know Hoag Pond does not have associated data in LakeCat
iws <- filter(iws,!lake %in% c('Hoag Pond'))

### remove lakes that are redundant with Snohomish Co. and Ketchum which has a treatment history
iws <- filter(iws,!lake %in% c("Armstrong","Sunday","Ketchum"))

#########################################################
### get Snohomish County data 
### provided separately for each parameter, so I will handle this a bit differently
### not calculating tn:tp yet
#########################################################
sno_locs <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'SnoCo Locations') %>% clean_names() %>% 
  # mutate(lake_name = str_to_title(lake_name))
  mutate(lake = str_to_title(lake_name), lake = ifelse(lake == 'Little Martha','Little_martha',lake)) %>% select(lake, everything()) %>% select(-lake_name) %>% rename(lat = latitude, lon = longitude)

sno_chla <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'Chla') %>% clean_names() %>% rename(lake_name = station_number) %>% 
  transmute(lake = str_to_title(lake_name), date = as.Date(date), year = year(date), year = year(date), month = month(date), day = day(date),
            depth = `depth_m`, layer = layer,
            Parm = 'chla', value = chla_result_ug_l, qualifier = qualifier_9, comment = qualifier_comment_10)

sno_tp <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'TP') %>% clean_names() %>% rename(lake_name = station_number) %>% 
  transmute(lake = str_to_title(lake_name), date = as.Date(date), year = year(date), year = year(date), month = month(date), day = day(date),
            depth = `depth_m`, layer = layer,
            Parm = "tp", value = parameter_method_value_tp_sm4500pf*1000, qualifier = qualifier_9, comment = qualifier_comment_10)

sno_tn <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'TN') %>% clean_names() %>% rename(lake_name = station_number) %>% 
  transmute(lake = str_to_title(lake_name), date = as.Date(date), year = year(date), year = year(date), month = month(date), day = day(date),
            depth = `depth_m`, layer = layer,
            Parm = "tn", value = tpn_result_mg_l*1000, qualifier = qualifier, comment = qualifier_comment)

sno_secchi <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'Secchi') %>% clean_names()  %>% rename(lake_name = station) %>% 
  transmute(lake = str_to_title(lake_name), date = as.Date(date), year = year(date), year = year(date), month = month(date), day = day(date),
            depth = 1, layer = 'Epilimnion',
            Parm = "secchi", value = secchi_depth_m, qualifier = qualifier, comment = qualifier_comment)

sno_color <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'TrueColor') %>% clean_names() %>% rename(lake_name = station_number) %>% 
  transmute(lake = str_to_title(lake_name), date = as.Date(date), year = year(date), year = year(date), month = month(date), day = day(date),
            depth = `depth_m`, layer = layer,
            Parm = "color", value = color_value_pcu, qualifier = qualifier, comment = qualifier_comment)

sno_wq <- bind_rows(sno_chla,sno_tp,sno_tn,sno_secchi,sno_color) %>% mutate(type = 'Snohomish')

sno_wq <- left_join(sno_wq, sno_locs, by = "lake") %>% filter(!is.na(lat)) %>% mutate(type = 'Snohomish')

### remove Martha_s
sno_wq <- filter(sno_wq,!lake=="Martha_s")
  
sno.js <- sno_wq %>% filter(layer == 'Epilimnion'&depth<=2) %>% group_by(lake,lat,lon,type,year,month,day,Parm) %>% 
  summarize(value = mean(value,na.rm=T), n = n()) %>% ungroup() %>%
  filter(between(month,7,9)&between(year,2020,2024)) %>% 
  group_by(lake,lat,lon,type,year,Parm) %>% summarize(value = mean(value,na.rm=T), n = n()) %>% ungroup() %>% 
  filter(n>=2) %>% 
  group_by(lake) %>% 
  mutate(ny = n()) %>% filter(ny>2) %>% 
  group_by(lake,lat,lon,type,Parm) %>% 
  summarize(value = mean(value,na.rm=T)) %>% ungroup() %>% 
  filter(!lake %in% c('Stevens','Ketchum','Beecher'))

# ggplot(sno_wq_long,aes(x=Parm,y=value)) +
#   geom_boxplot()
# 
#   tmp = filter(sno_wq_long,lake=='Armstrong'&Parm=='chla')
#   
#   # p <- ggplot(tmp, aes(date,value)) +
#   ggplot(tmp, aes(year,value)) +
#     theme_bw() +
#     geom_point() +
#     geom_smooth(se = FALSE) +
#     # expand_limits(y=0) +
#     ggtitle(tmp$lake[1])

##########################################################################
### now combine data and calculate summer (Jul-Sep) mean values
#
##########################################################################
df <- bind_rows(list(kc,nla,iws))
nms <- names(df)
df.long <- gather(df,-nms[c(1:6,12:13)],key=Parm,value=value)

# I'm going to play a bit lose with calculating means
# I'll use all of the NLA data (2007-2022)
# I'll use King Co data 2020-2024
# I'll use IWS data 2019-2023 with at least 4 years of data (No IWS data for 2024 at this time...but hopefull early next year?)

nla.js <- df.long %>% filter(type == 'NLA') %>% group_by(lake,type,lat,lon,Parm) %>% summarize(value = mean(value,na.rm=T)) %>% ungroup()
kc.js <- df.long %>% filter(type=="King Co"&between(year,2020,2024)&between(month,7,9)&depth==1) %>% group_by(lake,type,lat,lon,Parm) %>% summarize(value = mean(value,na.rm=T)) %>% ungroup()
iws.js <- df.long %>%  filter(type=='IWS'&between(year,2019,2023)&between(month,7,9)) %>% group_by(lake,type,lat,lon,Parm) %>% summarize(value = mean(value,na.rm=T), n = n()) %>% ungroup() %>% filter(n>3) %>% select(-n)

# df.long <- bind_rows(list(nla.js,kc.js,iws.js))
df.long <- bind_rows(list(nla.js,kc.js,iws.js,sno.js))

# now spread
df <- spread(df.long,key=Parm,value=value)

# now select only lakes with lat/lon within Puget Lowlands

### get PL ecoregion boundary....
ecoregion <- st_read("C:/Users/degaspec/OneDrive - King County/gis_external/EPA_Region10_ecoregions/reg10_eco_l3.shp",quiet = TRUE) %>%
  st_transform(4269)
# st_crs(ecoregion)
pugetl <- ecoregion %>% filter(US_L3NAME == "Puget Lowland") %>% dplyr::select(US_L3NAME) %>% 
  st_transform(4269)

### intersect puget lowland shape with sampling points
df <- st_as_sf(df, coords = c("lon", "lat"), remove = FALSE)
st_crs(df) <- 4269
df <- st_intersection(df,pugetl) %>% st_drop_geometry() %>% dplyr::select(-US_L3NAME)

df <- mutate(df, lake = str_remove_all(lake,"Lake | Lake")) %>% arrange(lake)
 tmp <- unique(df[c("lake","type")])

Currently, I have total phosphorus and chlorophyll a data for 115 and 114 lakes, respectively, across the Puget Lowland of the Puget Sound basin. I have Secchi depth data from 98 lakes because the IWS lake monitoring effort does not measure Secchi depth. To illustrate the coverage I’ll provide a quick map of mean summer total phosphorus concentrations.

code to map some lake data
library(mapview)

# mapview(df %>% filter(!lake=='Armstrong'), zcol = 'tp', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = 'TP Conc')  +
mapview(df, zcol = 'tp', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = 'TP Conc')  +
  mapview(pugetl, alpha.regions = 0.2)

Now I’ll pull in some LakeCat watershed data (here) and join it to the lake data. It turns out that StreamCatTools does not return any watershed data for Josephine Lake (NLA) although the COMID appears to be valid. Joining the lake water quality data with the land cover data leaves us with 114 lakes with total phosphorus/land cover data, 113 with chlorophyll a/land cover, and 97 with Secchi/land cover. Note that Pipe and Lucerne share the same COMID and so they have the same watershed data. I suppose I could chose to drop one of those lakes, but for now enough of being a data monkey!

I’ve also since learned that lake-specific characteristics, particularly lake depth, explain much of the variance in lake trophic state indicators (Read et al. (2015)). Therefore, I also added maximum lake depth to this data set as a potential explanatory variable.

code to load LakeCat data
library(StreamCatTools)
library(lakemorpho)
library(elevatr)

dd <- data.frame(x = df$lon,
                 y = df$lat) |> 
  sf::st_as_sf(coords = c('x', 'y'), crs = 4326)

# comids <- scan(textConnection(comids), sep=",")
# count.fields(textConnection(comids), sep = ",")

# for (i in 1:92){
#   print(i)
# comids <- lc_get_comid(dd[i,])
# }

comids <- lc_get_comid(dd)
comids2 <- data.frame(COMID = scan(textConnection(comids), sep=","))
df <- bind_cols(df,comids2)

#### run lakemorpho on all lakes using NHDWaterbody.shp
# ri_lakes <- st_read("./data/ri_lakes/ri_lakes.shp")
# lks <- st_read("C:/Users/degaspec/OneDrive - King County/gis_external/NHDPlusPN/NHDPlus17/NHDSnapshot/Hydrography/NHDWaterbody.shp") %>% 
#   # filter(COMID %in% c(24284092,24290072,23993571,24536772))  %>% 
#   st_transform(st_crs(ri_lakes))
# lkmorpho <- NULL
# # for (comid in df$COMID){
# for (i in 1:nrow(df)){
#   # lk <- filter(lks,COMID==comid)
#   lk <- filter(lks,COMID==df$COMID[i])
#   dem <- get_elev_raster(lk,z=12,expand=1000,src='aws')*3.281
#   lmorph <- lakeSurroundTopo(inLake = st_zm(lk),inElev = dem)
#   lmetric <- calcLakeMetrics(lmorph,bearing=270,pointDens = 100)
#   # dum <- data.frame(NAME = lk$GNIS_NAME,lmetric)
#   dum <- data.frame(lake = df$lake[i], type = df$type[i], lat = df$lat[i], lon = df$lon[i],
#                     lmetric)
#   lkmorpho <- bind_rows(lkmorpho,dum)
#   
# }
# rm(dum,dem,lmorph,lmetric,lk)
# 
# save(lkmorpho,file = 'lkmorpho.RData')
load('lkmorpho.RData')

#############################################################################################################
### select and import a bunch of metrics we can join to our King Co. and NLA lake water quality data and do 
### some statistical modeling of say summer chlorophyll a
### start with just watershed and not add 'catchment', 'other' [whatever that is], 'riparian catchment', or 
### 'riparian watershed' at least for now

# region_params <- lc_get_params(param='areaOfInterest')
# name_params <- lc_get_params(param='name')
# print(paste0('region parameters are: ', paste(region_params,collapse = ', ')))
#> [1] "region parameters are: catchment, other, riparian_catchment, riparian_watershed, watershed"
# print(paste0('A selection of available LakrCat metrics include: ',paste(name_params[1:10],collapse = ', ')))
#> [1] "A selection of available LakrCat metrics include: agkffact, al2o3, bfi, canaldens, cao, cbnf, clay, coalminedens, compstrgth, damdens"
#############################################################################################################
### I've already filtered the list down to this...
fnms <- read_csv('fullname_params_use.csv') %>% filter(use==1)

lc <- lc_get_data(metric=paste(fnms$name, collapse=","), aoi='watershed', comid=comids)
names(lc) <- toupper(names(lc))
# calculate %Developed and %Forest 
lc <- mutate(lc, PCTURB2019WS = PCTURBLO2019WS + PCTURBMD2019WS + PCTURBHI2019WS, PCTFOR2019WS = PCTCONIF2019WS + PCTMXFST2019WS + PCTDECID2019WS)

df.lc <- left_join(df,lc,by='COMID') %>% as.data.frame()

# there were 88 rows in df but only 85 in lc
# it turns out StreamCatTools does not return results for Josephine (NLA), returns the same data (same COMID) for Pipe and Lucern lakes, and Lake Armstrong was sampled in NLA and by IWS. I'll remove Josephine and one of the Armstrong Lake rows 
# df.lc <- filter(df.lc, !lake %in% c('Josephine Lake','Armstrong'))
df.lc <- filter(df.lc, !lake %in% c('Josephine'))

# sort(names(df.lc))

####################################################
### maybe this is where I add physical lake data to df?
### let's just start with maximum lake depth since we know 
### this is one of the most important explanatory variables
# iws.phys <- filter(df.lc,type=="IWS")
# iws.phys <- filter(df.lc,type=="IWS"&lake=="Honeymoon")
# 
# mapview(iws.phys, zcol = 'PCTFOR2019WS', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = '%forest')  +   mapview(pugetl, alpha.regions = 0.2)
# 
# nla.phys <- filter(df.lc,type=="NLA")
# nla.phys <- filter(df.lc,type=="NLA"&lake=="Lewis")
# 
# mapview(nla.phys, zcol = 'PCTFOR2019WS', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = '%forest')  +   mapview(pugetl, alpha.regions = 0.2)

iwsphys <- read.csv('data/iws_phys.csv') %>% select(lake,type,Depth_max_ft)
nlaphys <- read.csv('data/nla_phys.csv') %>% select(lake,type,Depth_max_ft)
snophys <- sno_locs %>% transmute(lake = lake, type = 'Snohomish', Depth_max_ft = max_depth_m*3.28084) 
kcphys <- read_csv('kcsites.csv') %>%  transmute(lake=Lake, type = "King Co", Depth_max_ft =  MaximumDepth)

lkphys <- bind_rows(iwsphys,nlaphys,snophys,kcphys)

df.lc <- left_join(df.lc,lkphys,by = c("lake","type"))

# tmp <- select(df.lc,lake,type,chla,PCTURB2019WS,Depth_max_ft)

So now I have lake and watershed data (and maximum lake depth) for up to 114 lakes across the Puget Lowland of the Puget Sound basin (King Co. = 42, NLA = 22, IWS = 16, Snohomish Co. = 34). To illustrate the coverage I’ll provide a quick map of lake watershed %forest cover.

code to map some watershed data
# table(df.lc$type)

mapview(df.lc, zcol = 'PCTFOR2019WS', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = '%forest')  +   mapview(pugetl, alpha.regions = 0.2)

So now we can get on with some exploratory data analysis. I’m going to use boosted regression tree (BRT) modeling to sift through the very large number of potential explanatory variables from LakeCat (496 in total, but some of these are the same metric for different points in time definitions here). I shortened the list to 44 metrics that seemed like they had some potential to explain spatial variation in lake trophic state…from the most obvious like %Developed and %Forest to soil erodability factors. I also log10 transformed the independent variables (chlorophyll a, tp, tn, and secchi).

Let’s start with chlorophyll a. The initial BRT model has narrowed down the number of potentially important variables (relative importance > 6 ) to less than a handful that definitively do not include %Developed or %Forest (see output below). Instead, the most influential variable by far was maximum lake depth. The next most influential variable was %Herbaceous wetland cover (PCTHBWET2019WS) followed by %Hay cover (PCTHAY2019WS).

The cross validation (CV) R2 of this model was 0.31. Not that great, but the model was improved slightly by limiting the explanatory variables to the three with relative importance greater than 6 (following per Waite et al. (2019)). That model had a CV R2 of 0.35.

Deeper lakes are associated with lower chlorophyll a. Higher watershed %Herbaceous wetland and %Hay cover are associated with higher lake chlorophyll a. Watershed %Forest or %Developed appear to have no significant influence on summer lake chlorophyll a, which is more or less consistent with the findings of Kammer (2004).

to perform exploratory data analysis for chlorophyll
### gbm + dismo - boosted regression tree packages
library(gbm)
library(dismo)

########################
### start with chlorophyll a
tmp <- filter(df.lc, !is.na(chla))
set.seed(2974)
gbm1 <- gbm.step(data = tmp %>% mutate(chla = log10(chla)), gbm.x = c(12:56), gbm.y=c(5), family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)

# https://stackoverflow.com/questions/60488587/boosted-regression-trees-deviance-values
# https://stats.stackexchange.com/questions/408119/which-method-is-correct-for-calculating-total-deviance-explained-for-boosted-reg
# mean(gbm1$residuals^2) # mean residual difference - here 0.08
# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance
# gbm1$self.statistics$mean.null # here 0.149 mean total deviance
# 
# # # R2
 paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.66"
to perform exploratory data analysis for chlorophyll
# # 
# # # CV R2
 paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))
[1] "Cross validation (CV) R2 = 0.27"
to perform exploratory data analysis for chlorophyll
# 
# # note how much lower the cross validation (CV) R2 is...this model would need a lot more work
#
# contributions of each predictor
# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)
# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = TRUE)
 par(mar=c(5,8,4,0)+.1)
summary(gbm1,cBars = 10,las=2,cex.names=0.8, plotit = TRUE)

                                              var     rel.inf
Depth_max_ft                         Depth_max_ft 46.82414025
PCTHBWET2019WS                     PCTHBWET2019WS 11.12127557
PCTHAY2019WS                         PCTHAY2019WS  7.77721136
WETINDEXWS                             WETINDEXWS  2.10867798
OMWS                                         OMWS  1.89332130
ELEVWS                                     ELEVWS  1.83791991
P2O5WS                                     P2O5WS  1.81028390
NWS                                           NWS  1.73744727
PRECIP8110WS                         PRECIP8110WS  1.64088058
PCTURBOP2019WS                     PCTURBOP2019WS  1.61290965
PCTOW2019WS                           PCTOW2019WS  1.52590009
PRECIP2009WS                         PRECIP2009WS  1.40895915
HYDRLCONDWS                           HYDRLCONDWS  1.39707411
PCTWDWET2019WS                     PCTWDWET2019WS  1.37737630
PCTDECID2019WS                     PCTDECID2019WS  1.32627574
TMEAN8110WS                           TMEAN8110WS  1.19286485
SLOPEWS                                   SLOPEWS  1.06872737
TMIN8110WS                             TMIN8110WS  1.05851598
TMEAN2009WS                           TMEAN2009WS  0.97953021
RUNOFFWS                                 RUNOFFWS  0.94876373
PCTGRS2019WS                         PCTGRS2019WS  0.83667139
TMAX8110WS                             TMAX8110WS  0.77334207
PCTBL2019WS                           PCTBL2019WS  0.76324893
PCTGLACTILLOAMWS                 PCTGLACTILLOAMWS  0.71338929
PCTSHRB2019WS                       PCTSHRB2019WS  0.70681448
PCTURBHI2019WS                     PCTURBHI2019WS  0.62622794
PCTCONIF2019WS                     PCTCONIF2019WS  0.50308511
PCTURBLO2019WS                     PCTURBLO2019WS  0.49244997
HUDEN2010WS                           HUDEN2010WS  0.46416409
PCTNONAGINTRODMANAGVEGWS PCTNONAGINTRODMANAGVEGWS  0.44886756
BFIWS                                       BFIWS  0.41688312
PCTMXFST2019WS                     PCTMXFST2019WS  0.38314193
RDDENSWS                                 RDDENSWS  0.37344583
AGKFFACTWS                             AGKFFACTWS  0.34266722
RCKDEPWS                                 RCKDEPWS  0.29277874
POPDEN2010WS                         POPDEN2010WS  0.26903599
PCTFOR2019WS                         PCTFOR2019WS  0.26840880
PCTURBMD2019WS                     PCTURBMD2019WS  0.21984060
PERMWS                                     PERMWS  0.12998652
WTDEPWS                                   WTDEPWS  0.08939852
PCTIMP2019WS                         PCTIMP2019WS  0.07938937
FERTWS                                     FERTWS  0.07349156
PCTURB2019WS                         PCTURB2019WS  0.03762291
KFFACTWS                                 KFFACTWS  0.03291590
MANUREWS                                 MANUREWS  0.01467687

Improved model with only three variables…

to perform improved BRT model for chlorophyll
set.seed(2974)
gbm1 <- gbm.step(data = tmp %>% mutate(chla = log10(chla)), gbm.x = c(28,29,56), gbm.y=c(5), family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)
# contributions of each predictor
# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)
  par(mar=c(5,8,4,0)+.1)
 summary(gbm1,cBars = 10,las=2,cex.names=0.8, plotit = TRUE)

                          var  rel.inf
Depth_max_ft     Depth_max_ft 63.58864
PCTHBWET2019WS PCTHBWET2019WS 19.65762
PCTHAY2019WS     PCTHAY2019WS 16.75375
to perform improved BRT model for chlorophyll
# # # R2
 paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.6"
to perform improved BRT model for chlorophyll
# # 
# # # CV R2
 paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))
[1] "Cross validation (CV) R2 = 0.35"
to perform improved BRT model for chlorophyll
# 
#

So I think now we could just follow Nora’s approach and look at the correlations of chlorophyll a with the two most important watershed variables and with %Developed and %Forest. There is a statistically significant (p<0.05) positive relationship with watershed %Herbaceous wetland cover, but clearly no correlation with watershed %Forest or %Developed cover.

code to create scatter plots for chlorophyll a
suppressWarnings(library(patchwork))
suppressWarnings(library(plotly))
suppressWarnings(library(ggpubr))

cor = with(df.lc,cor.test(PCTHBWET2019WS,log10(chla)))
p1 <- ggplotly(ggplot(df.lc,aes(PCTHBWET2019WS,log10(chla),color=type,text=lake)) +
  theme_bw() +
  geom_point() +
    stat_cor(method='pearson') +
    theme(legend.position = 'none') +
      annotate('text',x = quantile(df.lc$PCTHBWET2019WS[!is.na(df.lc$PCTHBWET2019WS)],0.95), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +
  ggtitle('Chlorophyll a vs %Herbaceous Wetland and %Hay'))

# cor = with(df.lc,cor.test(PCTOW2019WS,log10(chla)))
# p2 <- ggplotly(ggplot(df.lc,aes(PCTOW2019WS,log10(chla),color=type,text=lake)) +
#   theme_bw() +
#     annotate('text',x = quantile(df.lc$PCTOW2019WS[!is.na(df.lc$PCTOW2019WS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
#   geom_point())

cor = with(df.lc,cor.test(PCTHAY2019WS,log10(chla)))
p2 <- ggplotly(ggplot(df.lc,aes(PCTHAY2019WS,log10(chla),color=type,text=lake)) +
  theme_bw() +
    annotate('text',x = quantile(df.lc$PCTHAY2019WS[!is.na(df.lc$PCTHAY2019WS)],0.97), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  geom_point())

subplot(p1,p2,margin=0.1,shareY = T)
code to create scatter plots for chlorophyll a
# cor = with(df.lc,cor.test(PCTHAY2019WS,log10(chla)))
# p1 <- ggplotly(ggplot(df.lc,aes(PCTHAY2019WS,log10(chla),color=type,text=lake)) +
#   theme_bw() +
#   geom_point() +
#       annotate('text',x = quantile(df.lc$PCTHAY2019WS[!is.na(df.lc$PCTHAY2019WS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +    
#     theme(legend.position = 'none') +
#   ggtitle('Chlorophyll a vs %Hay and Hydrologic Conductivity'))
# cor = with(df.lc,cor.test(HYDRLCONDWS,log10(chla)))
# p2 <- ggplotly(ggplot(df.lc,aes(HYDRLCONDWS,log10(chla),color=type,text=lake)) +
#   theme_bw() +
#       annotate('text',x = quantile(df.lc$HYDRLCONDWS[!is.na(df.lc$HYDRLCONDWS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +    
#   geom_point())
#     
# subplot(p1,p2,margin=0.1,shareY = T)

cor = with(df.lc,cor.test(PCTURB2019WS,log10(chla)))
p1 <- ggplotly(ggplot(df.lc,aes(PCTURB2019WS,log10(chla),color=type,text=lake)) +
  theme_bw() +
  geom_point() +
    theme(legend.position = 'none') +
    annotate('text',x = quantile(df.lc$PCTURB2019WS[!is.na(df.lc$PCTURB2019WS)],0.75), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  ggtitle('Chlorophyll a vs %Developed and %Forest'))

cor = with(df.lc,cor.test(PCTFOR2019WS,log10(chla)))
p2 <- ggplotly(ggplot(df.lc,aes(PCTFOR2019WS,log10(chla),color=type,text=lake)) +
  theme_bw() +
        annotate('text',x = quantile(df.lc$PCTFOR2019WS[!is.na(df.lc$PCTFOR2019WS)],0.65), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  geom_point())
subplot(p1,p2,margin=0.1,shareY = T)

Now we can take a look at total phosphorus…. The BRT model again identified maximum depth as the most important variable with a negative relationship with lake total phosphorus. The next two variables %Hay cover and %Soil organic matter (OMWS) were relatively unimportant. The CV R2 of this model was 0.48. Simplification of this model did not result in an improved CV R2.

The scatter plots indicate a statistically significant (p<0.05) negative relationship with watershed %Soil organic matter, but clearly no correlation with watershed %Forest or %Developed cover.

to perform exploratory data analysis for total phosphorus
### gbm + dismo - boosted regression tree packages
library(gbm)
library(dismo)

########################
### total phosphorus
tmp <- filter(df.lc, !is.na(tp))
set.seed(2974)
gbm1 <- gbm.step(data = df.lc %>% mutate(tp = log10(tp)), gbm.x = c(12:56), gbm.y=c(10), family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)

# https://stackoverflow.com/questions/60488587/boosted-regression-trees-deviance-values
# https://stats.stackexchange.com/questions/408119/which-method-is-correct-for-calculating-total-deviance-explained-for-boosted-reg
# mean(gbm1$residuals^2) # mean residual difference - here 0.08
# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance
# gbm1$self.statistics$mean.null # here 0.149 mean total deviance
# 
# # # R2
 paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.8"
to perform exploratory data analysis for total phosphorus
# # 
# # # CV R2
 paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))
[1] "Cross validation (CV) R2 = 0.48"
to perform exploratory data analysis for total phosphorus
# 
# # note how much lower the cross validation (CV) R2 is...this model would need a lot more work
#

# contributions of each predictor
# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)
  par(mar=c(5,8,4,0)+.1)
 summary(gbm1,cBars = 10,las=2,cex.names=0.8, plotit = TRUE)

                                              var      rel.inf
Depth_max_ft                         Depth_max_ft 52.384966237
PCTHAY2019WS                         PCTHAY2019WS  4.750499940
OMWS                                         OMWS  4.163477580
PCTHBWET2019WS                     PCTHBWET2019WS  3.915978367
PRECIP8110WS                         PRECIP8110WS  3.855877144
HYDRLCONDWS                           HYDRLCONDWS  2.624971064
ELEVWS                                     ELEVWS  2.429877188
PCTWDWET2019WS                     PCTWDWET2019WS  2.384838477
TMEAN8110WS                           TMEAN8110WS  1.806353161
PCTURBMD2019WS                     PCTURBMD2019WS  1.707231305
PCTURBOP2019WS                     PCTURBOP2019WS  1.651723745
PCTDECID2019WS                     PCTDECID2019WS  1.430218085
TMIN8110WS                             TMIN8110WS  1.428920289
PRECIP2009WS                         PRECIP2009WS  1.309453432
RUNOFFWS                                 RUNOFFWS  1.040669254
P2O5WS                                     P2O5WS  0.968495528
WETINDEXWS                             WETINDEXWS  0.876070593
PCTGRS2019WS                         PCTGRS2019WS  0.816088222
PCTGLACTILLOAMWS                 PCTGLACTILLOAMWS  0.811939093
PCTOW2019WS                           PCTOW2019WS  0.780665098
PCTBL2019WS                           PCTBL2019WS  0.670762747
PCTCONIF2019WS                     PCTCONIF2019WS  0.630922904
TMAX8110WS                             TMAX8110WS  0.567499397
SLOPEWS                                   SLOPEWS  0.563257094
PCTSHRB2019WS                       PCTSHRB2019WS  0.559357443
PCTNONAGINTRODMANAGVEGWS PCTNONAGINTRODMANAGVEGWS  0.531674284
HUDEN2010WS                           HUDEN2010WS  0.525256441
POPDEN2010WS                         POPDEN2010WS  0.516772934
PCTURBLO2019WS                     PCTURBLO2019WS  0.465509395
TMEAN2009WS                           TMEAN2009WS  0.457283390
PCTMXFST2019WS                     PCTMXFST2019WS  0.452829011
RDDENSWS                                 RDDENSWS  0.414756713
BFIWS                                       BFIWS  0.355518793
NWS                                           NWS  0.355440948
PCTIMP2019WS                         PCTIMP2019WS  0.344331166
RCKDEPWS                                 RCKDEPWS  0.317378223
PCTFOR2019WS                         PCTFOR2019WS  0.308709036
PCTURBHI2019WS                     PCTURBHI2019WS  0.293712868
WTDEPWS                                   WTDEPWS  0.223388252
KFFACTWS                                 KFFACTWS  0.127887681
PCTURB2019WS                         PCTURB2019WS  0.093057117
PERMWS                                     PERMWS  0.053792046
AGKFFACTWS                             AGKFFACTWS  0.025424766
FERTWS                                     FERTWS  0.007163551
MANUREWS                                 MANUREWS  0.000000000
to perform exploratory data analysis for total phosphorus
# library(scales)
# source('gbm.plot.kc.R')
# gbm.plot.kc(gbm1, n.plots=3, write.title = F,
#         plot.layout=c(3, 1),
#         common.scale = TRUE,
#         # smooth = TRUE,
#         show.contrib = TRUE)

# set.seed(2974)
# gbm1 <- gbm.step(data = df.lc %>% mutate(tp = log10(tp)), gbm.x = c(34,36,40,44), gbm.y=c(10),
# gbm1 <- gbm.step(data = df.lc %>% mutate(tp = log10(tp)), gbm.x = c(33,57),
# gbm.y=c(10),family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)
# 
# # # # R2
#  paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
# # # 
# # # # CV R2
#  paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))
# # 
# # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work
# #
# # contributions of each predictor
# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)

cor = with(df.lc,cor.test(PCTHAY2019WS,log10(tp)))
p1 <- ggplotly(ggplot(df.lc,aes(PCTHAY2019WS,log10(tp),color=type,text=lake)) +
  theme_bw() +
    annotate('text',x = quantile(df.lc$PCTHAY2019WS[!is.na(df.lc$PCTHAY2019WS)],0.97), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  geom_point()+
  ggtitle('Total Phosphorus vs %Hay and %Soil OM')) 

cor = with(df.lc,cor.test(OMWS,log10(tp)))
p2 <- ggplotly(ggplot(df.lc,aes(OMWS,log10(tp),color=type,text=lake)) +
  theme_bw() +
        annotate('text',x = quantile(df.lc$OMWS[!is.na(df.lc$OMWS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +
  geom_point())
# p1 <- ggplotly(ggplot(df.lc,aes(PCTHBWET2019WS,log10(tp),color=type,text=lake)) +
#   theme_bw() +
#   geom_point() +
#     stat_cor(method='pearson') +
#     theme(legend.position = 'none') +
#       annotate('text',x = quantile(df.lc$PCTHBWET2019WS[!is.na(df.lc$PCTHBWET2019WS)],0.95), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +
#   ggtitle('Total Phosphorus vs %Herbaceous Wetland and %Hay'))

subplot(p1,p2,margin=0.1,shareY = T)
to perform exploratory data analysis for total phosphorus
cor = with(df.lc,cor.test(PCTURB2019WS,log10(tp)))
p1 <- ggplotly(ggplot(df.lc,aes(PCTURB2019WS,log10(tp),color=type,text=lake)) +
  theme_bw() +
  geom_point() +
    theme(legend.position = 'none') +
    annotate('text',x = quantile(df.lc$PCTURB2019WS[!is.na(df.lc$PCTURB2019WS)],0.75), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  ggtitle('Total Phosphorus vs %Developed and %Forest'))

cor = with(df.lc,cor.test(PCTFOR2019WS,log10(tp)))
p2 <- ggplotly(ggplot(df.lc,aes(PCTFOR2019WS,log10(tp),color=type,text=lake)) +
  theme_bw() +
        annotate('text',x = quantile(df.lc$PCTFOR2019WS[!is.na(df.lc$PCTFOR2019WS)],0.7), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  geom_point())
subplot(p1,p2,margin=0.1,shareY = T)

Now we can take a look at Secchi depth. The top explanatory variable was maximum lake depth. The next two variables with relative importance greater than 6 were the %Herbaceous and %Woody wetland cover metrics. The CV R2 of this model was 0.43. Simplification of this model did not result in an improved CV R2.

The scatter plots show that there is a statistically significant (p<0.05) negative relationship with watershed %Herbaceous and %Woody wetland cover, but clearly no correlation with watershed %Forest or %Developed cover.

to perform exploratory data analysis for secchi depth
### gbm + dismo - boosted regression tree packages
library(gbm)
library(dismo)

########################
### secchi depth
tmp <- filter(df.lc, !is.na(secchi))
set.seed(2974)
gbm1 <- gbm.step(data = df.lc %>% mutate(secchi = log10(secchi)) %>% filter(!is.na(secchi)), gbm.x = c(12:56), gbm.y=c(7), family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)

# https://stackoverflow.com/questions/60488587/boosted-regression-trees-deviance-values
# https://stats.stackexchange.com/questions/408119/which-method-is-correct-for-calculating-total-deviance-explained-for-boosted-reg
# mean(gbm1$residuals^2) # mean residual difference - here 0.08
# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance
# gbm1$self.statistics$mean.null # here 0.149 mean total deviance

# # # R2
 paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.79"
to perform exploratory data analysis for secchi depth
# # 
# # # CV R2
 paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))
[1] "Cross validation (CV) R2 = 0.43"
to perform exploratory data analysis for secchi depth
# 
# # note how much lower the cross validation (CV) R2 is...this model would need a lot more work
#

# contributions of each predictor
# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)
  par(mar=c(5,8,4,0)+.1)
 summary(gbm1,cBars = 10,las=2,cex.names=0.8, plotit = TRUE)

                                              var     rel.inf
Depth_max_ft                         Depth_max_ft 43.17405030
PCTHBWET2019WS                     PCTHBWET2019WS 18.09981347
PCTWDWET2019WS                     PCTWDWET2019WS  6.63752059
PCTURBOP2019WS                     PCTURBOP2019WS  4.09792322
SLOPEWS                                   SLOPEWS  2.57065329
P2O5WS                                     P2O5WS  2.26636511
PCTOW2019WS                           PCTOW2019WS  1.90447651
NWS                                           NWS  1.45495766
PCTGRS2019WS                         PCTGRS2019WS  1.19424272
PCTNONAGINTRODMANAGVEGWS PCTNONAGINTRODMANAGVEGWS  1.08293148
PCTDECID2019WS                     PCTDECID2019WS  1.04722876
WETINDEXWS                             WETINDEXWS  1.03695804
PERMWS                                     PERMWS  0.98868892
HUDEN2010WS                           HUDEN2010WS  0.94346470
PRECIP2009WS                         PRECIP2009WS  0.93394591
TMAX8110WS                             TMAX8110WS  0.85821979
PCTHAY2019WS                         PCTHAY2019WS  0.80488442
OMWS                                         OMWS  0.75488555
TMEAN8110WS                           TMEAN8110WS  0.68109698
PCTGLACTILLOAMWS                 PCTGLACTILLOAMWS  0.66713875
TMEAN2009WS                           TMEAN2009WS  0.60614319
PRECIP8110WS                         PRECIP8110WS  0.58637853
PCTBL2019WS                           PCTBL2019WS  0.58613010
PCTCONIF2019WS                     PCTCONIF2019WS  0.56079669
POPDEN2010WS                         POPDEN2010WS  0.55097492
TMIN8110WS                             TMIN8110WS  0.52927396
HYDRLCONDWS                           HYDRLCONDWS  0.52302004
BFIWS                                       BFIWS  0.49125109
RCKDEPWS                                 RCKDEPWS  0.48209275
PCTURBHI2019WS                     PCTURBHI2019WS  0.47143270
PCTSHRB2019WS                       PCTSHRB2019WS  0.40615895
PCTURBLO2019WS                     PCTURBLO2019WS  0.38484991
RUNOFFWS                                 RUNOFFWS  0.37005194
ELEVWS                                     ELEVWS  0.35209097
PCTFOR2019WS                         PCTFOR2019WS  0.33247408
RDDENSWS                                 RDDENSWS  0.31478493
PCTMXFST2019WS                     PCTMXFST2019WS  0.29701495
KFFACTWS                                 KFFACTWS  0.24435081
PCTURBMD2019WS                     PCTURBMD2019WS  0.21340561
AGKFFACTWS                             AGKFFACTWS  0.20614671
PCTIMP2019WS                         PCTIMP2019WS  0.11529631
WTDEPWS                                   WTDEPWS  0.06895566
PCTURB2019WS                         PCTURB2019WS  0.05849430
FERTWS                                     FERTWS  0.04141030
MANUREWS                                 MANUREWS  0.00757444
to perform exploratory data analysis for secchi depth
# 
 # set.seed(2974)
 # gbm1 <- gbm.step(data = df.lc %>% mutate(secchi = log10(secchi)) %>% filter(!is.na(secchi)), gbm.x = c(36,40,57), gbm.y=c(7), family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)


cor = with(df.lc,cor.test(PCTHBWET2019WS,log10(secchi)))
p1 <- ggplotly(ggplot(df.lc,aes(PCTHBWET2019WS,log10(secchi),color=type,text=lake)) +
  theme_bw() +
  geom_point() +
    stat_cor(method='pearson') +
    theme(legend.position = 'none') +
      annotate('text',x = quantile(df.lc$PCTHBWET2019WS[!is.na(df.lc$PCTHBWET2019WS)],0.95), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,7),sep=" ")) +
  ggtitle('secchi vs %Herbaceous and %Woody Wetland'))
cor = with(df.lc,cor.test(PCTWDWET2019WS,log10(secchi)))
p2 <- ggplotly(ggplot(df.lc,aes(PCTWDWET2019WS,log10(secchi),color=type,text=lake)) +
  theme_bw() +
    annotate('text',x = quantile(df.lc$PCTWDWET2019WS[!is.na(df.lc$PCTWDWET2019WS)],0.99), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  geom_point())

subplot(p1,p2,margin=0.1,shareY = T)
to perform exploratory data analysis for secchi depth
# cor = with(df.lc,cor.test(PCTOW2019WS,log10(secchi)))
# p1 <- ggplotly(ggplot(df.lc,aes(PCTOW2019WS,log10(secchi),color=type,text=lake)) +
#   theme_bw() +
#   geom_point() +
#       annotate('text',x = quantile(df.lc$PCTOW2019WS[!is.na(df.lc$PCTOW2019WS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +    
#     theme(legend.position = 'none') +
#   # ggtitle('secchi vs %Open Water and Baseflow Index'))
#     ggtitle('secchi vs %Open Water'))
# cor = with(df.lc,cor.test(BFIWS,log10(secchi)))
# p2 <- ggplotly(ggplot(df.lc,aes(BFIWS,log10(secchi),color=type,text=lake)) +
#   theme_bw() +
#       annotate('text',x = quantile(df.lc$BFIWS[!is.na(df.lc$BFIWS)],0.5), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +    
#   geom_point())
#     
# # subplot(p1,p2,margin=0.1,shareY = T)
# p1

cor = with(df.lc,cor.test(PCTURB2019WS,log10(secchi)))
p1 <- ggplotly(ggplot(df.lc,aes(PCTURB2019WS,log10(secchi),color=type,text=lake)) +
  theme_bw() +
  geom_point() +
    theme(legend.position = 'none') +
    annotate('text',x = quantile(df.lc$PCTURB2019WS[!is.na(df.lc$PCTURB2019WS)],0.75), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  ggtitle('secchi vs %Developed and %Forest'))

cor = with(df.lc,cor.test(PCTFOR2019WS,log10(secchi)))
p2 <- ggplotly(ggplot(df.lc,aes(PCTFOR2019WS,log10(secchi),color=type,text=lake)) +
  theme_bw() +
        annotate('text',x = quantile(df.lc$PCTFOR2019WS[!is.na(df.lc$PCTFOR2019WS)],0.75), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +
  geom_point())
subplot(p1,p2,margin=0.1,shareY = T)

Even the watershed wetland variables explain very little in the variance in lake trophic state across the Puget Lowlands of Puget Sound. Is lake trophic state more a product of a legacy of change from forest clearing, agricultural development, to urbanization? Would sewer vs septic explain more of the variance? Nora also found that hydraulic retention time was correlated with lake trophic state. Would hydraulic retention time explain as much or more of the variance explained by maximum lake depth?

Some suggested reading:

Kammer (2004). Lake Water Quality and Health: Land Use Considerations for Small Lake Watersheds in King County, Washington. King County, WA.

Politi et al. (2024) A global typological approach to classify lakes based on their eutrophication risk. Aquatic Sciences 86:52.

Read et al. (2015) The importance of lake-specific characteristics for water quality across the continental United States. Ecological Applications 25(4):943-955.

Sabo et al. (2023) Comparing drivers of spatial variability in U.S. lake and stream phosphorus concentrations. J. Geophys. Res. Biogeosci. 128(8).

Addendum - Quick look at relationship between lake trophic state and septic system density

To do this I used the latest septic system coverage for King County and looked at the correlation of lake watershed septic system density (OSSs per hectare: OSS_HA) with lake trophic state indicators as determined above. One of several possible limitations is that I do not currently have individual watershed delineations for every King County lake and some lake watersheds extend into Snohomish County (Snohomish County provided septic system data!). I had to limit this analysis to King County lakes because septic data for the entire Puget Lowland of Puget Sound is unlikely to ever become available, although it might be possible in the future to do as Moore et al. (2003) did and contact each local jurisdiction for an indication of whether or not the lake should be classified as served by sewers or septic systems or was considered undeveloped.

I also tried to look at the sum of the inverse distance of septic systems to the lake perimeter, but this didn’t seem to provide any improvement in the relationships with lake trophic state.

One next step would be to collect watershed delineations for Snohomish lakes and the missing King Co. lake delineations to increase the size of the dataset available for analysis.

Although there seems to be a hint of the expected relationship…increasing septic system density leads to higher lake trophic state that hypothesis is not strongly supported by the data. The greatest support appears for summer lake nitrogen concentration which is consistent with nitrate being the most mobile nutrient released from working septic systems.

Note, septic systems do not need to “fail” to release nutrients to local soils and groundwater. That is what working septic systems are designed to do - break down organic matter into water, carbon dioxide, and soluble nutrients. A “failing” septic system is one that releases untreated organic waste to soils or water (surface or ground).

to perform exploratory data analysis for lake watershed septic density
### the operation to load the King Co septic data is too time consuming to include here so I will show the work but read the processed data
# oss_poly <- st_read("//gisdw/KCLIB/Plibrary2/utility/shapes/polygon/septic_onsite_parcel.shp")  
# oss_points <- st_centroid(oss_poly) %>% filter(ExpExclude == "OSS") %>%
#   st_transform(4269)
# 
# lake_ws <- st_read("C:/Users/degaspec/OneDrive - King County/gis_external/small_lakes/watersheds_all.shp")  %>%
#   st_transform(4269)
# 
# unique(oss_points$ExpExclude)
# 
# sewer_line <- st_read("//gisdw/KCLIB/Plibrary2/utility/shapes/arc/sewer.shp")  %>%
#   st_transform(4269)
# 
# sewer_serv <- st_read("//gisdw/KCLIB/Plibrary2/utility/shapes/polygon/sewer_serv.shp")  %>%
#   st_transform(4269)
# 
# lake_ws_oss <- st_filter(oss_points,lake_ws)
# lake_ws_oss.i <- st_intersection(oss_points,lake_ws)

# there is a different result depending on the use of filter or intersection. Choosing the intersection result for now.
# mapview(lake_ws_oss, size = 0.5) + mapview(lake_ws, alpha.regions = 0.2) + mapview(sewer, alpha.regions = 0.5, color = 'pink') +
#   mapview(sewer_serv, alpha.regions = 0.2, col.regions = 'pink')
# 
# mapview(lake_ws_oss.i, size = 0.5) + mapview(lake_ws, alpha.regions = 0.2) + mapview(sewer, alpha.regions = 0.5, color = 'pink') +
#   mapview(sewer_serv, alpha.regions = 0.2, col.regions = 'pink')

# lake_ws_oss_dens <- st_drop_geometry(lake_ws_oss.i) %>% data.frame()
# lake_ws_oss_dens <- lake_ws_oss_dens %>% group_by(SOURCETHM) %>% summarize(OSS_HA = n()/HECTARES, n = n(), HECTARES = HECTARES) %>% ungroup() %>% distinct()
# unique(lake_ws_oss.i$SOURCETHM)
# 
# lake_ws_oss_dens <- mutate(lake_ws_oss_dens,lake = str_replace(SOURCETHM,"_s.*",""))
# 
# saveRDS(lake_ws_oss_dens,'lake_ws_oss_dens.rds')
lake_ws_oss_dens <- readRDS('lake_ws_oss_dens.rds')

kc.lakes <- dplyr::filter(df,type=='King Co')

lake_ws_oss_dens <- mutate(lake_ws_oss_dens, lake = ifelse(lake == "Echo","Echo-Shoreline",
                                           ifelse(lake == 'Beaver1',"Beaver-1",
                                                  ifelse(lake == 'Beaver2','Beaver-2',
                                                         ifelse(lake == 'Neilson', "Neilson (Holm)",
                                                                ifelse(lake == 'Mcdonald','McDonald',
                                           lake))))))

tmp <- data.frame(OSS_HA = 0, n = 0, lake = "Green-1" )
lake_ws_oss_dens <- bind_rows(lake_ws_oss_dens,tmp)

# kcsites <- read_csv('kcsites.csv') %>%  transmute(lake=Lake, Locator=SITE_ID, lat = LAT_DD83, lon = LON_DD83)
kcsites <- read_csv('kcsites.csv') %>%  transmute(lake=Lake, Locator=SITE_ID, depth_max_ft = depth_max_ft)
# kcsites$Sewer_status <-  fct_collapse(kcsites$Sewer_status, Sewered = c("Sewered","Fringe"))
tmp2 <- left_join(kc.lakes,kcsites,by="lake")

# tmp2 <- left_join(kc.lakes,lake_ws_oss_dens,by='lake')
tmp2 <- left_join(tmp2,lake_ws_oss_dens,by='lake')

cor = with(tmp2,cor.test(OSS_HA,log10(chla)))
p1 <- ggplot(tmp2,aes(OSS_HA,log(chla))) +
           geom_point() +
           # geom_quantile(formula = y~x, quantiles = 0.9) +
           geom_smooth(formula = y~x, method = 'lm') +
           theme(legend.position = 'none') +
           annotate('text',x = quantile(tmp2$OSS_HA,0.9,na.rm=T), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
           ggtitle('Chlorophyll a vs OSS density (#/ha)') # )

cor = with(tmp2,cor.test(OSS_HA,log10(tp)))
p2 <- ggplot(tmp2,aes(OSS_HA,log(tp))) +
           geom_point() +
           # geom_quantile(formula = y~x, quantiles = 0.9) +
           geom_smooth(formula = y~x, method = 'lm') +
           theme(legend.position = 'none') +
           annotate('text',x = quantile(tmp2$OSS_HA,0.9,na.rm=T), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
           ggtitle('Total Phosphorus vs OSS density (#/ha)') #)

cor = with(tmp2,cor.test(OSS_HA,log10(tn)))
p3 <- ggplot(tmp2,aes(OSS_HA,log(tn))) +
           geom_point() +
           # geom_quantile(formula = y~x, quantiles = 0.9) +
           geom_smooth(formula = y~x, method = 'lm') +
           theme(legend.position = 'none') +
           annotate('text',x = quantile(tmp2$OSS_HA,0.9,na.rm=T), y = 6.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
           ggtitle('Total Nitrogen vs OSS density (#/ha)') #)

cor = with(tmp2,cor.test(OSS_HA,log10(secchi)))
p4 <- ggplot(tmp2,aes(OSS_HA,log(secchi))) +
                 geom_point() +
                 # geom_quantile(formula = y~x, quantiles = 0.9) +
                 geom_smooth(formula = y~x, method = 'lm') +
                 theme(legend.position = 'none') +
                 annotate('text',x = quantile(tmp2$OSS_HA,0.9,na.rm=T), y = 1.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
                 ggtitle('secchi vs OSS density (#/ha)') #)

(p1+p2)/(p3+p4)

to perform exploratory data analysis for lake watershed septic density
# load('lake_oss_wd.RData')
# lk_oss_wd$lake = lk_oss_wd$ws_name
# lk_oss_wd %>% mutate(lk_oss_wd, lake = ifelse(lake=="Echo","Echo-Shoreline",ifelse(lake=="Beaver1","Beaver-1",ifelse(lake=="Beaver2","Beaver-2",lake)))) %>% st_drop_geometry()
# 
# tmp3 <- left_join(kc.lakes,lk_oss_wd,by="lake") %>% filter(!is.na(invdist))
# 
# # invdist
# 
# cor = with(tmp3,cor.test(invdist,log10(chla)))
# p1 <- ggplot(tmp3,aes(invdist,log(chla))) +
#            geom_point() +
#            # geom_quantile(formula = y~x, quantiles = 0.9) +
#            geom_smooth(formula = y~x, method = 'lm') +
#            theme(legend.position = 'none') +
#            annotate('text',x = quantile(tmp3$invdist,0.9,na.rm=T), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#            ggtitle('Chlorophyll a vs OSS inverse distance (ft)') # )
# 
# cor = with(tmp3,cor.test(invdist,log10(tp)))
# p2 <- ggplot(tmp3,aes(invdist,log(tp))) +
#            geom_point() +
#            # geom_quantile(formula = y~x, quantiles = 0.9) +
#            geom_smooth(formula = y~x, method = 'lm') +
#            theme(legend.position = 'none') +
#            annotate('text',x = quantile(tmp3$invdist,0.9,na.rm=T), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#            ggtitle('Total Phosphorus vs OSS inverse distance (ft)') #)
# 
# cor = with(tmp3,cor.test(invdist,log10(tn)))
# p3 <- ggplot(tmp3,aes(invdist,log(tn))) +
#            geom_point() +
#            # geom_quantile(formula = y~x, quantiles = 0.9) +
#            geom_smooth(formula = y~x, method = 'lm') +
#            theme(legend.position = 'none') +
#            annotate('text',x = quantile(tmp3$invdist,0.9,na.rm=T), y = 6.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#            ggtitle('Total Nitrogen vs OSS inverse distance (ft)') #)
# 
# cor = with(tmp3,cor.test(invdist,log10(secchi)))
# p4 <- ggplot(tmp3,aes(invdist,log(secchi))) +
#                  geom_point() +
#                  # geom_quantile(formula = y~x, quantiles = 0.9) +
#                  geom_smooth(formula = y~x, method = 'lm') +
#                  theme(legend.position = 'none') +
#                  annotate('text',x = quantile(tmp3$invdist,0.9,na.rm=T), y = 1.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#                  ggtitle('secchi vs OSS inverse distance (ft)') #)
# 
# (p1+p2)/(p3+p4)
# 
# cor = with(tmp3,cor.test(log10(invdist),log10(chla)))
# p1 <- ggplot(tmp3,aes(log10(invdist),log(chla))) +
#            geom_point() +
#            # geom_quantile(formula = y~x, quantiles = 0.9) +
#            geom_smooth(formula = y~x, method = 'lm') +
#            theme(legend.position = 'none') +
#            annotate('text',x = quantile(log10(tmp3$invdist),0.9,na.rm=T), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#            ggtitle('Chlorophyll a vs OSS inverse distance (ft)') # )
# 
# # log10(invdist)
# 
# cor = with(tmp3,cor.test(log10(invdist),log10(tp)))
# p2 <- ggplot(tmp3,aes(log10(invdist),log(tp))) +
#            geom_point() +
#            # geom_quantile(formula = y~x, quantiles = 0.9) +
#            geom_smooth(formula = y~x, method = 'lm') +
#            theme(legend.position = 'none') +
#            annotate('text',x = quantile(log10(tmp3$invdist),0.9,na.rm=T), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#            ggtitle('Total Phosphorus vs OSS inverse distance (ft)') #)
# 
# cor = with(tmp3,cor.test(log10(invdist),log10(tn)))
# p3 <- ggplot(tmp3,aes(log10(invdist),log(tn))) +
#            geom_point() +
#            # geom_quantile(formula = y~x, quantiles = 0.9) +
#            geom_smooth(formula = y~x, method = 'lm') +
#            theme(legend.position = 'none') +
#            annotate('text',x = quantile(log10(tmp3$invdist),0.9,na.rm=T), y = 6.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#            ggtitle('Total Nitrogen vs OSS inverse distance (ft)') #)
# 
# cor = with(tmp3,cor.test(log10(invdist),log10(secchi)))
# p4 <- ggplot(tmp3,aes(log10(invdist),log(secchi))) +
#                  geom_point() +
#                  # geom_quantile(formula = y~x, quantiles = 0.9) +
#                  geom_smooth(formula = y~x, method = 'lm') +
#                  theme(legend.position = 'none') +
#                  annotate('text',x = quantile(log10(tmp3$invdist),0.9,na.rm=T), y = 1.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
#                  ggtitle('secchi vs OSS inverse distance (ft)') #)
# 
# (p1+p2)/(p3+p4)

I also tried placing each lake watershed into three categories: wholly within the regional sewer system, straddling the boundary (fringe), and wholly outside of the regional sewer system. The differences were consistent with the conclusions of Moore et al. (2003).

Multiple comparisons indicated lower trophic state within and outside of the regional sewer service area and highest for lakes on the fringe. Lakes within the fringe would have fewer septics systems and regional sewer service and lake outside of the fringe would be least developed and have the fewest number of septics systems. Lakes on the fringe would be have relatively high development and the largest number of septic systems. However, none of the apparent differences were statistically significant.

I also checked to see if there were any systematic differences in lake depth in each of the categories. It looks like lake depths were similarly distributed within each category. Lake depth seems like a variable that needs to be considered in any statistical evaluation of the spatial variability of trophic state.

to perform exploratory data analysis for lake watershed septic density - alt
###############################################################################################################################
### different approach
###############################################################################################################################

library(rstatix)
library(forcats)
library(ggpubr)
library(coin)
# library(MKinfer)
library(rcompanion)
library(permuco)

kcsites <- read_csv('kcsites.csv') %>%  transmute(lake=Lake, Locator=SITE_ID, depth_max_ft = depth_max_ft,
             Sewer_status = factor(Sewer_status, ordered=T,levels = c("Sewered","Fringe","Unsewered")))
# kcsites$Sewer_status <-  fct_collapse(kcsites$Sewer_status, Sewered = c("Sewered","Fringe"))
tmp2 <- left_join(kc.lakes,kcsites,by="lake")

mod <- lm(log10(tn) ~ Sewer_status, data = tmp2)
aov <- aov(mod)
summary(aov)
             Df Sum Sq  Mean Sq F value Pr(>F)
Sewer_status  2 0.0105 0.005258   0.274  0.762
Residuals    39 0.7486 0.019194               
to perform exploratory data analysis for lake watershed septic density - alt
TukeyHSD(aov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = mod)

$Sewer_status
                          diff        lwr       upr     p adj
Fringe-Sewered     0.043053091 -0.1003719 0.1864780 0.7465239
Unsewered-Sewered  0.008226729 -0.1077478 0.1242012 0.9836761
Unsewered-Fringe  -0.034826362 -0.1809835 0.1113307 0.8312814
to perform exploratory data analysis for lake watershed septic density - alt
# par(mar=c(3,8,3,3))
par(mar=c(5,7,4,1)+.1)
plot(TukeyHSD(aov),las=1,tcl = -.3)

to perform exploratory data analysis for lake watershed septic density - alt
sample_size <- tmp2 %>% group_by(Sewer_status) %>% summarize(num=n())
tmp2 <- left_join(tmp2, sample_size, by="Sewer_status") %>%  mutate(Sewer_status = paste0(Sewer_status, "\n", "n=", num)) %>% 
   # hard coded converting new "Sewer_status = n"  to ordered factor
    mutate(Sewer_status = factor(Sewer_status,ordered=T,levels = c("Sewered\nn=18","Fringe\nn=8","Unsewered\nn=16")))

res.kruskal <- coin::kruskal_test(chla ~ Sewer_status, data = tmp2)
res.kruskal

    Asymptotic Linear-by-Linear Association Test

data:  chla by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = 1.0639, p-value = 0.2874
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2, aes(x=Sewer_status,y=chla)) +
  theme_bw() +
  geom_boxplot() +
  labs(x="", y="Chlorophyll a (µg/L)") +
  ggtitle("Regional sewer status and Chlorophll a") +
  stat_compare_means()

to perform exploratory data analysis for lake watershed septic density - alt
res.kruskal <- coin::kruskal_test(tp ~ Sewer_status, data = tmp2)
res.kruskal

    Asymptotic Linear-by-Linear Association Test

data:  tp by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.22399, p-value = 0.8228
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2,aes(x=Sewer_status,y=tp)) +
  theme_bw() +
  geom_boxplot() +
  labs(x="", y="Total Phosphorus (µg/L)") +
  ggtitle("Regional sewer status and Total Phosphorus") +
  stat_compare_means()

to perform exploratory data analysis for lake watershed septic density - alt
res.kruskal <- coin::kruskal_test(tn ~ Sewer_status, data = tmp2)
res.kruskal

    Asymptotic Linear-by-Linear Association Test

data:  tn by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = 0.88195, p-value = 0.3778
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2,aes(x=Sewer_status,y=tn)) +
  theme_bw() +
  geom_boxplot() +
  labs(x="", y="Total Nitrogen (µg/L)") +
  ggtitle("Regional sewer status and Total Nitrogen") +
  stat_compare_means()

to perform exploratory data analysis for lake watershed septic density - alt
res.kruskal <- coin::kruskal_test(secchi ~ Sewer_status, data = tmp2)
res.kruskal

    Asymptotic Linear-by-Linear Association Test

data:  secchi by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -1.0219, p-value = 0.3068
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2,aes(x=Sewer_status,y=secchi)) +
  theme_bw() +
  geom_boxplot() +
  labs(x="", y="Secchi transparency (m)") +
  ggtitle("Regional sewer status and Secchi transparency")  +
  stat_compare_means()

to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2, aes(x=Sewer_status,y=depth_max_ft)) +
  theme_bw() +
  geom_boxplot() +
  labs(x="", y="Maximum depth (ft)") +
  ggtitle("Regional sewer status and lake depth") +
  stat_compare_means()

to perform exploratory data analysis for lake watershed septic density - alt
# permutation tests
# https://rcompanion.org/rcompanion/d_06a.html
independence_test(data = tmp2, chla ~ Sewer_status)

    Asymptotic General Independence Test

data:  chla by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = 0.90362, p-value = 0.3662
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
independence_test(data = tmp2, tp ~ Sewer_status)

    Asymptotic General Independence Test

data:  tp by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.17386, p-value = 0.862
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
independence_test(data = tmp2, tn ~ Sewer_status)

    Asymptotic General Independence Test

data:  tn by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.051213, p-value = 0.9592
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
independence_test(data = tmp2, secchi ~ Sewer_status)

    Asymptotic General Independence Test

data:  secchi by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -1.3118, p-value = 0.1896
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
independence_test(data = tmp2, depth_max_ft ~ Sewer_status)

    Asymptotic General Independence Test

data:  depth_max_ft by
     Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.43215, p-value = 0.6656
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
# multiple comparison permutation test
pairwisePermutationTest(data = tmp2, secchi ~ Sewer_status, method = 'fdr')
                           Comparison   Stat p.value p.adjust
1     Sewered\nn=18 - Fringe\nn=8 = 0 -1.868 0.06177   0.1853
2 Sewered\nn=18 - Unsewered\nn=16 = 0 -1.247  0.2125   0.2417
3   Fringe\nn=8 - Unsewered\nn=16 = 0  1.171  0.2417   0.2417
to perform exploratory data analysis for lake watershed septic density - alt
PT = pairwisePermutationTest(data = tmp2, secchi ~ Sewer_status, method = 'fdr')
cldList(p.adjust ~ Comparison,
        data = PT,
        threshold  = 0.05)
           Group Letter MonoLetter
1   Sewered\nn18      a          a
2     Fringe\nn8      a          a
3 Unsewered\nn16      a          a
to perform exploratory data analysis for lake watershed septic density - alt
# ancova to control for lake depth in multiple comparisons
# https://www.datanovia.com/en/lessons/ancova-in-r/
res.aov <- tmp2 %>% anova_test(secchi ~ depth_max_ft + Sewer_status)
get_anova_table(res.aov)
ANOVA Table (type II tests)

        Effect DFn DFd      F        p p<.05   ges
1 depth_max_ft   1  36 22.935 2.86e-05     * 0.389
2 Sewer_status   2  36  1.264 2.95e-01       0.066
to perform exploratory data analysis for lake watershed septic density - alt
# permutation ancova (permuco)
res.pavo <- aovperm(secchi ~ Sewer_status, data = tmp2, np = 100000)
res.pavo
Anova Table
Resampling test using freedman_lane to handle nuisance variables and 1e+05 permutations.
                 SS df     F parametric P(>F) resampled P(>F)
Sewer_status  13.06  2 2.346           0.1091           0.108
Residuals    108.52 39                                       
to perform exploratory data analysis for lake watershed septic density - alt
res.pavo <- aovperm(secchi ~ depth_max_ft*Sewer_status, data = tmp2, np = 100000)
res.pavo
Anova Table
Resampling test using freedman_lane to handle nuisance variables and 1e+05 permutations.
                              SS df      F parametric P(>F) resampled P(>F)
depth_max_ft              15.800  1 8.9246         0.005192         0.00395
Sewer_status               0.686  2 0.1937         0.824778         0.82694
depth_max_ft:Sewer_status  3.145  2 0.8883         0.420700         0.42542
Residuals                 60.192 34                                        

Addendum - Quick look at relationship between lake trophic state and maximum lake depth

After reviewing one of the suggested readings above (Read et al. 2015), I found that lake depth might explain most of the variability in observed lake trophic state. Here I just use the King County lake data. Fortunately, I found a source of King County (and Snohomish Co) lake physical data that I suspect has been used in the past to populate webpage and report tables - Bortleson et al. (1976). The Bortleson report also contains a treasure trove of lake water quality observations that might also be of some use.

to perform exploratory data analysis for maximum lake depth
#### let's look @ some lake physical characteristics
lkphys <- read_csv("C:/Users/degaspec/OneDrive - King County/R/NLA/kcsites.csv") %>% rename(lake = Lake)

tmp2 <- left_join(tmp2,lkphys,by='lake')

cor = with(tmp2,cor.test(MaximumDepth,log10(chla)))
p1 <- ggplot(tmp2,aes(MaximumDepth,log(chla))) +
           geom_point() +
           # geom_quantile(formula = y~x, quantiles = 0.9) +
           geom_smooth(formula = y~x, method = 'lm') +
           theme(legend.position = 'none') +
           annotate('text',x = quantile(tmp2$MaximumDepth,0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
           ggtitle('Chlorophyll a vs Maximum Lake Depth')

cor = with(tmp2,cor.test(MaximumDepth,log10(tp)))
p2 <- ggplot(tmp2,aes(MaximumDepth,log(tp))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp2$MaximumDepth,0.9), y = 3.5, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('Total Phosphorus vs Maximum Lake Depth')

cor = with(tmp2,cor.test(MaximumDepth,log10(tn)))
p3 <- ggplot(tmp2,aes(MaximumDepth,log(tp))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp2$MaximumDepth,0.9), y = 3.5, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('Total Nitrogen vs Maximum Lake Depth')

cor = with(tmp2,cor.test(MaximumDepth,log10(secchi)))
p4 <- ggplot(tmp2,aes(MaximumDepth,log(secchi))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp2$MaximumDepth,0.9), y = 2.25, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('secchi Depth vs Maximum Lake Depth')

(p1+p2)/(p3+p4)

I requested and received some maximum lake depth (actually index site sample depth, but similar enough I think since the index site target is the deepest part of each lake) data for the National Lakes Assessment Lakes. I find a similar correlation between index site sample depth and trophic state indicators.

to perform exploratory data analysis for NLA idex site depth
#### let's look @ some NLA lake physical characteristics
nla.lkphys <- read_csv("C:/Users/degaspec/OneDrive - King County/R/NLA/data/NLA_WA_2007-2022_lakes_depth_area.csv") 

# The NLA data files do not include lake names so I had to first filter the national data to lakes in the Puget Lowland and then identified lake names using Google Maps
nlasites <- read_csv('sites_lake_names.csv') %>% rename(lake = Lake)

nla2 <- left_join(nlasites,nla.lkphys,by="SITE_ID") %>% mutate(INDEX_SITE_DEPTH = as.numeric(INDEX_SITE_DEPTH)) %>% group_by(lake) %>% summarize(AREA_HA = mean(AREA_HA,na.rm=TRUE), PERIM_KM = mean(PERIM_KM,na.rm=TRUE), MaximumDepth = mean(INDEX_SITE_DEPTH,na.rm=TRUE), WSAREASQKM = mean(WSAREASQKM,na.rm=TRUE)) %>% ungroup() %>% mutate(Type = "NLA")

# tmp3 <- left_join(nla,df,by='lake')
tmp3 <- left_join(nla2,nla,by='lake')

cor = with(tmp3,cor.test(MaximumDepth,log10(chla)))
p1 <- ggplot(tmp3,aes(MaximumDepth,log(chla))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp3$MaximumDepth,0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('Chlorophyll a vs Index Site Depth')

cor = with(tmp3,cor.test(MaximumDepth,log10(tp)))
p2 <- ggplot(tmp3,aes(MaximumDepth,log(tp))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp3$MaximumDepth,0.9), y = 3.5, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,7),sep=" ")) +
  ggtitle('Total Phosphorus vs Index Site Depth')

cor = with(tmp3,cor.test(MaximumDepth,log10(tn)))
p3 <- ggplot(tmp3,aes(MaximumDepth,log(tp))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp3$MaximumDepth,0.9), y = 3.5, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('Total Nitrogen vs Index Site Depth')

cor = with(tmp3,cor.test(MaximumDepth,log10(secchi)))
p4 <- ggplot(tmp3,aes(MaximumDepth,log(secchi))) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp3$MaximumDepth,0.9), y = 2.25, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('secchi Depth vs Index Site Depth')

(p1+p2)/(p3+p4)

I combined the data to see what that might look like… Similar relationships with somewhat different slopes with the possible exception of chlorophyll a. It also looks like the NLA lakes include lakes that are shallower than some King County lakes, but there are a number of King County lakes that are deeper than any of the NLA lakes in the Puget Lowland of Puget Sound.

to perform exploratory data analysis for combined lake depth data
tmp4 <- bind_rows(tmp2,tmp3)

cor = with(tmp4,cor.test(MaximumDepth,log10(chla)))
p1 <- ggplot(tmp4,aes(MaximumDepth,log(chla),color=Type)) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp4$MaximumDepth,0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('Chlorophyll a vs Index Site Depth')

cor = with(tmp4,cor.test(MaximumDepth,log10(tp)))
p2 <- ggplot(tmp4,aes(MaximumDepth,log(tp),color=Type)) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp4$MaximumDepth,0.9), y = 3.5, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,7),sep=" ")) +
  ggtitle('Total Phosphorus vs Index Site Depth')

cor = with(tmp4,cor.test(MaximumDepth,log10(tn)))
p3 <- ggplot(tmp4,aes(MaximumDepth,log(tp),color=Type)) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp4$MaximumDepth,0.9), y = 3.5, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('Total Nitrogen vs Index Site Depth')

cor = with(tmp4,cor.test(MaximumDepth,log10(secchi)))
p4 <- ggplot(tmp4,aes(MaximumDepth,log(secchi),color=Type)) +
  geom_point() +
  # geom_quantile(formula = y~x, quantiles = 0.9) +
  geom_smooth(formula = y~x, method = 'lm') +
  # theme(legend.position = 'none') +
  annotate('text',x = quantile(tmp4$MaximumDepth,0.9), y = 2.25, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +
  ggtitle('secchi Depth vs Index Site Depth')

(p1+p2)/(p3+p4)

And now a preliminary look at some similar analyses using data collected by the USGS on many, many lakes in the Puget Sound basin in the early 1970s…starting with data for King and Snohomish County lakes.

As I noted above, Bortleson et al. (1976) contains a treasure trove of data ranging from watershed land use, lake physical characteristics, and water quality data (single summer samples from the early 1970s). The water quality data include TP, TN (derived by summing N forms), Secchi depth, and Color (PCU), but sadly no chlorophyll a measurements.

It took a few days to enter the data for 362 lakes. After filtering these lakes to those in the Puget Lowland and excluding some lakes because of they weren’t small lakes (e.g., Lake Union and Portage Bay in King County), they were saline, they were mostly covered by emergent vegetation (they were really wetlands), or they were reservoirs on a larger river system. This screening resulted in 226 Puget Lowland lakes.

I conducted a separate analysis of these data that can be found here.

Please take a look, but what I found from that analysis was that the variables that best explained Summer lake surface TP in order of importance (most to least) were water color, maximum lake depth, and %Agriculture. The variables that best explained Summer lake Secchi depth in order of importance (most to least) were water color and maximum lake depth.