Over 20 years ago, Nora Kammer (2004) reported on analyses of King County Small Lakes data to address three hypotheses:
As the proportion of high density land uses in lake watersheds increase, lakes will become more eutrophic.
As the proportion of forested land uses in lake watersheds increase, lakes will become less eutrophic.
Septic systems within lake watersheds will cause lakes to become more eutrophic.
Although not explicitly stated, Nora’s approach is often described as a space-for-time approach. Nora looked for correlations between independent variables (e.g., watershed %forest) and dependent variables (trophic state indicators chlorophyll a, total phosphorus, and secchi depth) at specific points in time.
The space-for-time approach uses spatial gradients of independent variables as a proxy for temporal changes in dependent variables. This approach is typically used because data were not collected at one or more individual locations over a period of time covering significant changes in the independent variables of interest.
Nora’s results were quite striking. High density development in a lake’s watershed was negatively correlated with lake trophic state indicators (higher density development ~ lower trophic state). and forest cover was positively correlated with trophic state indicators (higher forest cover ~ higher trophic state). A less counter intuitive finding was that lakes with sewered watersheds (i.e., few or no septic systems) tended to be less eutrophic.
Methods
I started by pulling the King Co Lake Stewardship data from our SQL database. I also downloaded National Lakes Assessment (NLA) data (here) as well as data from Western Washington University’s Institute for Watershed Studies (IWS) (here). The IWS samples a number of small lakes primarily in the northeast portion of the Puget Sound basin. I also requested and received lake data from Snohomish County.
I removed some lakes from the data set because they were rather large (Lake Sammamish) or they were highly altered or managed (e.g., alum treatments). These lakes included Capitol, Ketchum, and Stevens. I also removed some redundant data preferring Snohomish County data over data provided by NLA or IWS (e.g., Goodwin, Armstrong, and Sunday).
code to load and process the lake data
library(tidyverse)library(lubridate)library(readxl)library(janitor)library(RODBC)library(sf)library(conflicted)conflict_prefer("filter", "dplyr", quiet = T)conflict_prefer("select", "dplyr", quiet = T)############################################################ Note that I haven't attempted to deal with nondetects### nor have I assessed their frequency. I suspect nondetects### for chlorophyll a, tp, and tn would be infrequent in lake### surface waters during summer### I also haven't tried to deal with secchi measurements### that extend to the lake bottom. There are some in the ### NLA dataset...perhaps some in the King Co and IWS data?##################################################################################################################### get King Co data#########################################################ch <-odbcConnect("Small lakes")kc <-sqlQuery(ch, paste("SELECT Site.ID, Site.SiteName, Site.Locator, WQData.SiteID, WQData.CollectDate, WQData.Depth, WQData.DepthQ, WQData.ChlorophyllA, WQData.ChloroQ, WQData.TotalPhosphorus, WQData.TotalPQ, WQData.TotalNitrogen, WQData.TotalNQ, WQData.secchi, WQData.secchiQ, WQData.UV254, WQData.UV254Q FROM WQData INNER JOIN Site ON WQData.SiteID = Site.ID"))odbcClose(ch)kc <- kc %>%filter(!SiteName %in%c('Hicklin','North','Steel')) %>%transmute(lake = SiteName, Locator = Locator, type ="King Co", year =year(as.Date(CollectDate)), month =month(as.Date(CollectDate)), day =day(as.Date(CollectDate)), depth = Depth, chla = ChlorophyllA, tp = TotalPhosphorus, tn = TotalNitrogen, secchi = secchi, tn_tp = tn/tp) # add Lat/Lon - some of these were adjusted so they would occur on the lake and intersect with National Hydrography Data (NHD) waterbodies and so they would work with the StreamCat package that extracts the LakeCat datakcsites <-read_csv('kcsites.csv') %>%transmute(lake=Lake, Locator=SITE_ID, lat = LAT_DD83, lon = LON_DD83)kc <-left_join(kc,kcsites,by=c("lake","Locator")) %>%select(-Locator) %>%arrange(lake)############################################################ get National Lakes Assessment (NLA) data ############################################################ add dummy month = 8 NLA lakes sampled July-September### a bit tricky because they sometimes enter different lat/lon for each sampling eventnla <-read_csv('./data/nla2007-2022_data_forpopestimates_indexvisits_probsites_0.csv') %>%filter(PSTL_CODE=='WA') %>%transmute(SITE_ID = SITE_ID, type ='NLA', year = DSGN_CYCLE, month =8, day =15, depth =1, chla = CHLA_RESULT, tn = NTL_RESULT, tp =PTL_RESULT, tn_tp = tn/tp, lat = LAT_DD83, lon= LON_DD83)#### secchi data is in separate files for some reasonsecchi07 <-read_csv('data/2007/NLA2007_secchi_20091008.csv') %>%select(SITE_ID,SECMEAN) %>%inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>%rename(SECCHI = SECMEAN) %>%group_by(SITE_ID) %>%summarize(SECCHI =mean(SECCHI)) %>%ungroup() # secchi07 <- inner_join(secchi07,sites[c('SITE_ID','PSTL_CODE')],by='SITE_ID')secchi12 <-read_csv('data/2012/nla2012_secchi_08232016.csv') %>%select(SITE_ID,SECCHI) %>%inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>%group_by(SITE_ID) %>%summarize(SECCHI =mean(SECCHI)) %>%ungroup() secchi17 <-read_csv('data/2017/nla_2017_secchi-data.csv') %>%select(SITE_ID,CLEAR_TO_BOTTOM,INDEX_SITE_DEPTH,DISAPPEARS,REAPPEARS) %>%inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>%mutate(SECCHI = (DISAPPEARS+REAPPEARS)/2) %>%select(-DISAPPEARS,-REAPPEARS)secchi22 <-read_csv('data/2022/nla22_secchi.csv') %>%select(SITE_ID,CLEAR_TO_BOTTOM,INDEX_SITE_DEPTH,DISAPPEARS,REAPPEARS) %>%inner_join(.,nla[c('SITE_ID')],by='SITE_ID') %>%mutate(SECCHI = (DISAPPEARS+REAPPEARS)/2) %>%select(-DISAPPEARS,-REAPPEARS) %>%group_by(SITE_ID,CLEAR_TO_BOTTOM) %>%summarize(SECCHI =mean(SECCHI),INDEX_SITE_DEPTH =mean(INDEX_SITE_DEPTH)) %>%ungroup() secchi.nla <-bind_rows(list(secchi07,secchi12,secchi17,secchi22))nla <-left_join(nla,secchi.nla,by="SITE_ID") # The NLA data files do not include lake names so I had to first filter the national data to lakes in the Puget Lowland and then identified lake names using Google Mapsnlasites <-read_csv('sites_lake_names.csv') %>%rename(lake = Lake)nla <-left_join(nlasites,nla,by="SITE_ID") %>%rename(secchi = SECCHI) %>%select(-SITE_ID, -CLEAR_TO_BOTTOM, -INDEX_SITE_DEPTH)# pull out lat/lon and rejoin unique set for each lakenla_ll <- nla %>%select(lake,lat,lon) %>%distinct(lake,.keep_all =TRUE)nla <-select(nla,-lat,-lon)nla <-left_join(nla,nla_ll,by='lake')# I know that one lat/lon (Sunwood Lake) needs to be fixed based on previous experiencenla$lon[39]<--122.774407nla$lat[39]<-46.970290### remove some lakes that are redundant with lakes sampled by Snohomish Co. and remove### Capitol Lake since it is a unicorn...heavily modified, tidal, enormous watershed (all Deschutes River)### and Sammamish since it isn't a "small" lakenla <-filter(nla,!lake %in%c("Capitol Lake","Lake Goodwin","Lake Armstrong","Lake Sammamish"))############################################################ get Western Washington University Institute for Watershed Studies (IWS) data ### note: no secchi data :-(#########################################################iws <-read_csv('data/IWSwaterqualitydata-all.csv', na ="NA")# I fixed some of these coordinates to work with StreamCatiws_lldd <-read_csv('data/lake_coordinates_all_mod.csv') %>%rename(key = code) %>%select(-mod) # %>% rename(lake = Lake) iws <-left_join(iws,iws_lldd,by="key") ### beware....the line below removes the less than detection symbol ("<")iws[] <-lapply(iws, gsub, pattern='<', replacement='')# convert text values to numericiws <- iws %>%mutate_at(c(4:19,22:23), as.numeric) %>%select(lake,year,month,day,chl,tn,tp,lat,long) %>%rename(chla = chl, lon = long ) %>%mutate(type ='IWS', depth =1, tn_tp = tn/tp)# I also know Hoag Pond does not have associated data in LakeCatiws <-filter(iws,!lake %in%c('Hoag Pond'))### remove lakes that are redundant with Snohomish Co. and Ketchum which has a treatment historyiws <-filter(iws,!lake %in%c("Armstrong","Sunday","Ketchum"))############################################################ get Snohomish County data ### provided separately for each parameter, so I will handle this a bit differently### not calculating tn:tp yet#########################################################sno_locs <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'SnoCo Locations') %>%clean_names() %>%# mutate(lake_name = str_to_title(lake_name))mutate(lake =str_to_title(lake_name), lake =ifelse(lake =='Little Martha','Little_martha',lake)) %>%select(lake, everything()) %>%select(-lake_name) %>%rename(lat = latitude, lon = longitude)sno_chla <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'Chla') %>%clean_names() %>%rename(lake_name = station_number) %>%transmute(lake =str_to_title(lake_name), date =as.Date(date), year =year(date), year =year(date), month =month(date), day =day(date),depth =`depth_m`, layer = layer,Parm ='chla', value = chla_result_ug_l, qualifier = qualifier_9, comment = qualifier_comment_10)sno_tp <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'TP') %>%clean_names() %>%rename(lake_name = station_number) %>%transmute(lake =str_to_title(lake_name), date =as.Date(date), year =year(date), year =year(date), month =month(date), day =day(date),depth =`depth_m`, layer = layer,Parm ="tp", value = parameter_method_value_tp_sm4500pf*1000, qualifier = qualifier_9, comment = qualifier_comment_10)sno_tn <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'TN') %>%clean_names() %>%rename(lake_name = station_number) %>%transmute(lake =str_to_title(lake_name), date =as.Date(date), year =year(date), year =year(date), month =month(date), day =day(date),depth =`depth_m`, layer = layer,Parm ="tn", value = tpn_result_mg_l*1000, qualifier = qualifier, comment = qualifier_comment)sno_secchi <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'Secchi') %>%clean_names() %>%rename(lake_name = station) %>%transmute(lake =str_to_title(lake_name), date =as.Date(date), year =year(date), year =year(date), month =month(date), day =day(date),depth =1, layer ='Epilimnion',Parm ="secchi", value = secchi_depth_m, qualifier = qualifier, comment = qualifier_comment)sno_color <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/data/Snohomish County Lake Data-v2.xlsx",'TrueColor') %>%clean_names() %>%rename(lake_name = station_number) %>%transmute(lake =str_to_title(lake_name), date =as.Date(date), year =year(date), year =year(date), month =month(date), day =day(date),depth =`depth_m`, layer = layer,Parm ="color", value = color_value_pcu, qualifier = qualifier, comment = qualifier_comment)sno_wq <-bind_rows(sno_chla,sno_tp,sno_tn,sno_secchi,sno_color) %>%mutate(type ='Snohomish')sno_wq <-left_join(sno_wq, sno_locs, by ="lake") %>%filter(!is.na(lat)) %>%mutate(type ='Snohomish')### remove Martha_ssno_wq <-filter(sno_wq,!lake=="Martha_s")sno.js <- sno_wq %>%filter(layer =='Epilimnion'&depth<=2) %>%group_by(lake,lat,lon,type,year,month,day,Parm) %>%summarize(value =mean(value,na.rm=T), n =n()) %>%ungroup() %>%filter(between(month,7,9)&between(year,2020,2024)) %>%group_by(lake,lat,lon,type,year,Parm) %>%summarize(value =mean(value,na.rm=T), n =n()) %>%ungroup() %>%filter(n>=2) %>%group_by(lake) %>%mutate(ny =n()) %>%filter(ny>2) %>%group_by(lake,lat,lon,type,Parm) %>%summarize(value =mean(value,na.rm=T)) %>%ungroup() %>%filter(!lake %in%c('Stevens','Ketchum','Beecher'))# ggplot(sno_wq_long,aes(x=Parm,y=value)) +# geom_boxplot()# # tmp = filter(sno_wq_long,lake=='Armstrong'&Parm=='chla')# # # p <- ggplot(tmp, aes(date,value)) +# ggplot(tmp, aes(year,value)) +# theme_bw() +# geom_point() +# geom_smooth(se = FALSE) +# # expand_limits(y=0) +# ggtitle(tmp$lake[1])############################################################################# now combine data and calculate summer (Jul-Sep) mean values###########################################################################df <-bind_rows(list(kc,nla,iws))nms <-names(df)df.long <-gather(df,-nms[c(1:6,12:13)],key=Parm,value=value)# I'm going to play a bit lose with calculating means# I'll use all of the NLA data (2007-2022)# I'll use King Co data 2020-2024# I'll use IWS data 2019-2023 with at least 4 years of data (No IWS data for 2024 at this time...but hopefull early next year?)nla.js <- df.long %>%filter(type =='NLA') %>%group_by(lake,type,lat,lon,Parm) %>%summarize(value =mean(value,na.rm=T)) %>%ungroup()kc.js <- df.long %>%filter(type=="King Co"&between(year,2020,2024)&between(month,7,9)&depth==1) %>%group_by(lake,type,lat,lon,Parm) %>%summarize(value =mean(value,na.rm=T)) %>%ungroup()iws.js <- df.long %>%filter(type=='IWS'&between(year,2019,2023)&between(month,7,9)) %>%group_by(lake,type,lat,lon,Parm) %>%summarize(value =mean(value,na.rm=T), n =n()) %>%ungroup() %>%filter(n>3) %>%select(-n)# df.long <- bind_rows(list(nla.js,kc.js,iws.js))df.long <-bind_rows(list(nla.js,kc.js,iws.js,sno.js))# now spreaddf <-spread(df.long,key=Parm,value=value)# now select only lakes with lat/lon within Puget Lowlands### get PL ecoregion boundary....ecoregion <-st_read("C:/Users/degaspec/OneDrive - King County/gis_external/EPA_Region10_ecoregions/reg10_eco_l3.shp",quiet =TRUE) %>%st_transform(4269)# st_crs(ecoregion)pugetl <- ecoregion %>%filter(US_L3NAME =="Puget Lowland") %>% dplyr::select(US_L3NAME) %>%st_transform(4269)### intersect puget lowland shape with sampling pointsdf <-st_as_sf(df, coords =c("lon", "lat"), remove =FALSE)st_crs(df) <-4269df <-st_intersection(df,pugetl) %>%st_drop_geometry() %>% dplyr::select(-US_L3NAME)df <-mutate(df, lake =str_remove_all(lake,"Lake | Lake")) %>%arrange(lake) tmp <-unique(df[c("lake","type")])
Currently, I have total phosphorus and chlorophyll a data for 115 and 114 lakes, respectively, across the Puget Lowland of the Puget Sound basin. I have Secchi depth data from 98 lakes because the IWS lake monitoring effort does not measure Secchi depth. To illustrate the coverage I’ll provide a quick map of mean summer total phosphorus concentrations.
Now I’ll pull in some LakeCat watershed data (here) and join it to the lake data. It turns out that StreamCatTools does not return any watershed data for Josephine Lake (NLA) although the COMID appears to be valid. Joining the lake water quality data with the land cover data leaves us with 114 lakes with total phosphorus/land cover data, 113 with chlorophyll a/land cover, and 97 with Secchi/land cover. Note that Pipe and Lucerne share the same COMID and so they have the same watershed data. I suppose I could chose to drop one of those lakes, but for now enough of being a data monkey!
I’ve also since learned that lake-specific characteristics, particularly lake depth, explain much of the variance in lake trophic state indicators (Read et al. (2015)). Therefore, I also added maximum lake depth to this data set as a potential explanatory variable.
code to load LakeCat data
library(StreamCatTools)library(lakemorpho)library(elevatr)dd <-data.frame(x = df$lon,y = df$lat) |> sf::st_as_sf(coords =c('x', 'y'), crs =4326)# comids <- scan(textConnection(comids), sep=",")# count.fields(textConnection(comids), sep = ",")# for (i in 1:92){# print(i)# comids <- lc_get_comid(dd[i,])# }comids <-lc_get_comid(dd)comids2 <-data.frame(COMID =scan(textConnection(comids), sep=","))df <-bind_cols(df,comids2)#### run lakemorpho on all lakes using NHDWaterbody.shp# ri_lakes <- st_read("./data/ri_lakes/ri_lakes.shp")# lks <- st_read("C:/Users/degaspec/OneDrive - King County/gis_external/NHDPlusPN/NHDPlus17/NHDSnapshot/Hydrography/NHDWaterbody.shp") %>% # # filter(COMID %in% c(24284092,24290072,23993571,24536772)) %>% # st_transform(st_crs(ri_lakes))# lkmorpho <- NULL# # for (comid in df$COMID){# for (i in 1:nrow(df)){# # lk <- filter(lks,COMID==comid)# lk <- filter(lks,COMID==df$COMID[i])# dem <- get_elev_raster(lk,z=12,expand=1000,src='aws')*3.281# lmorph <- lakeSurroundTopo(inLake = st_zm(lk),inElev = dem)# lmetric <- calcLakeMetrics(lmorph,bearing=270,pointDens = 100)# # dum <- data.frame(NAME = lk$GNIS_NAME,lmetric)# dum <- data.frame(lake = df$lake[i], type = df$type[i], lat = df$lat[i], lon = df$lon[i],# lmetric)# lkmorpho <- bind_rows(lkmorpho,dum)# # }# rm(dum,dem,lmorph,lmetric,lk)# # save(lkmorpho,file = 'lkmorpho.RData')load('lkmorpho.RData')################################################################################################################ select and import a bunch of metrics we can join to our King Co. and NLA lake water quality data and do ### some statistical modeling of say summer chlorophyll a### start with just watershed and not add 'catchment', 'other' [whatever that is], 'riparian catchment', or ### 'riparian watershed' at least for now# region_params <- lc_get_params(param='areaOfInterest')# name_params <- lc_get_params(param='name')# print(paste0('region parameters are: ', paste(region_params,collapse = ', ')))#> [1] "region parameters are: catchment, other, riparian_catchment, riparian_watershed, watershed"# print(paste0('A selection of available LakrCat metrics include: ',paste(name_params[1:10],collapse = ', ')))#> [1] "A selection of available LakrCat metrics include: agkffact, al2o3, bfi, canaldens, cao, cbnf, clay, coalminedens, compstrgth, damdens"################################################################################################################ I've already filtered the list down to this...fnms <-read_csv('fullname_params_use.csv') %>%filter(use==1)lc <-lc_get_data(metric=paste(fnms$name, collapse=","), aoi='watershed', comid=comids)names(lc) <-toupper(names(lc))# calculate %Developed and %Forest lc <-mutate(lc, PCTURB2019WS = PCTURBLO2019WS + PCTURBMD2019WS + PCTURBHI2019WS, PCTFOR2019WS = PCTCONIF2019WS + PCTMXFST2019WS + PCTDECID2019WS)df.lc <-left_join(df,lc,by='COMID') %>%as.data.frame()# there were 88 rows in df but only 85 in lc# it turns out StreamCatTools does not return results for Josephine (NLA), returns the same data (same COMID) for Pipe and Lucern lakes, and Lake Armstrong was sampled in NLA and by IWS. I'll remove Josephine and one of the Armstrong Lake rows # df.lc <- filter(df.lc, !lake %in% c('Josephine Lake','Armstrong'))df.lc <-filter(df.lc, !lake %in%c('Josephine'))# sort(names(df.lc))####################################################### maybe this is where I add physical lake data to df?### let's just start with maximum lake depth since we know ### this is one of the most important explanatory variables# iws.phys <- filter(df.lc,type=="IWS")# iws.phys <- filter(df.lc,type=="IWS"&lake=="Honeymoon")# # mapview(iws.phys, zcol = 'PCTFOR2019WS', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = '%forest') + mapview(pugetl, alpha.regions = 0.2)# # nla.phys <- filter(df.lc,type=="NLA")# nla.phys <- filter(df.lc,type=="NLA"&lake=="Lewis")# # mapview(nla.phys, zcol = 'PCTFOR2019WS', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = '%forest') + mapview(pugetl, alpha.regions = 0.2)iwsphys <-read.csv('data/iws_phys.csv') %>%select(lake,type,Depth_max_ft)nlaphys <-read.csv('data/nla_phys.csv') %>%select(lake,type,Depth_max_ft)snophys <- sno_locs %>%transmute(lake = lake, type ='Snohomish', Depth_max_ft = max_depth_m*3.28084) kcphys <-read_csv('kcsites.csv') %>%transmute(lake=Lake, type ="King Co", Depth_max_ft = MaximumDepth)lkphys <-bind_rows(iwsphys,nlaphys,snophys,kcphys)df.lc <-left_join(df.lc,lkphys,by =c("lake","type"))# tmp <- select(df.lc,lake,type,chla,PCTURB2019WS,Depth_max_ft)
So now I have lake and watershed data (and maximum lake depth) for up to 114 lakes across the Puget Lowland of the Puget Sound basin (King Co. = 42, NLA = 22, IWS = 16, Snohomish Co. = 34). To illustrate the coverage I’ll provide a quick map of lake watershed %forest cover.
So now we can get on with some exploratory data analysis. I’m going to use boosted regression tree (BRT) modeling to sift through the very large number of potential explanatory variables from LakeCat (496 in total, but some of these are the same metric for different points in time definitions here). I shortened the list to 44 metrics that seemed like they had some potential to explain spatial variation in lake trophic state…from the most obvious like %Developed and %Forest to soil erodability factors. I also log10 transformed the independent variables (chlorophyll a, tp, tn, and secchi).
Let’s start with chlorophyll a. The initial BRT model has narrowed down the number of potentially important variables (relative importance > 6 ) to less than a handful that definitively do not include %Developed or %Forest (see output below). Instead, the most influential variable by far was maximum lake depth. The next most influential variable was %Herbaceous wetland cover (PCTHBWET2019WS) followed by %Hay cover (PCTHAY2019WS).
The cross validation (CV) R2 of this model was 0.31. Not that great, but the model was improved slightly by limiting the explanatory variables to the three with relative importance greater than 6 (following per Waite et al. (2019)). That model had a CV R2 of 0.35.
Deeper lakes are associated with lower chlorophyll a. Higher watershed %Herbaceous wetland and %Hay cover are associated with higher lake chlorophyll a. Watershed %Forest or %Developed appear to have no significant influence on summer lake chlorophyll a, which is more or less consistent with the findings of Kammer (2004).
to perform exploratory data analysis for chlorophyll
### gbm + dismo - boosted regression tree packageslibrary(gbm)library(dismo)########################### start with chlorophyll atmp <-filter(df.lc, !is.na(chla))set.seed(2974)gbm1 <-gbm.step(data = tmp %>%mutate(chla =log10(chla)), gbm.x =c(12:56), gbm.y=c(5), family="gaussian", tree.complexity =5, learning.rate=0.001, bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# https://stackoverflow.com/questions/60488587/boosted-regression-trees-deviance-values# https://stats.stackexchange.com/questions/408119/which-method-is-correct-for-calculating-total-deviance-explained-for-boosted-reg# mean(gbm1$residuals^2) # mean residual difference - here 0.08# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance# gbm1$self.statistics$mean.null # here 0.149 mean total deviance# # # # R2paste0('R2 = ', round((gbm1$self.statistics$mean.null -mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.66"
to perform exploratory data analysis for chlorophyll
to perform exploratory data analysis for chlorophyll
# # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work## contributions of each predictor# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = TRUE)par(mar=c(5,8,4,0)+.1)summary(gbm1,cBars =10,las=2,cex.names=0.8, plotit =TRUE)
So I think now we could just follow Nora’s approach and look at the correlations of chlorophyll a with the two most important watershed variables and with %Developed and %Forest. There is a statistically significant (p<0.05) positive relationship with watershed %Herbaceous wetland cover, but clearly no correlation with watershed %Forest or %Developed cover.
code to create scatter plots for chlorophyll a
suppressWarnings(library(patchwork))suppressWarnings(library(plotly))suppressWarnings(library(ggpubr))cor =with(df.lc,cor.test(PCTHBWET2019WS,log10(chla)))p1 <-ggplotly(ggplot(df.lc,aes(PCTHBWET2019WS,log10(chla),color=type,text=lake)) +theme_bw() +geom_point() +stat_cor(method='pearson') +theme(legend.position ='none') +annotate('text',x =quantile(df.lc$PCTHBWET2019WS[!is.na(df.lc$PCTHBWET2019WS)],0.95), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +ggtitle('Chlorophyll a vs %Herbaceous Wetland and %Hay'))# cor = with(df.lc,cor.test(PCTOW2019WS,log10(chla)))# p2 <- ggplotly(ggplot(df.lc,aes(PCTOW2019WS,log10(chla),color=type,text=lake)) +# theme_bw() +# annotate('text',x = quantile(df.lc$PCTOW2019WS[!is.na(df.lc$PCTOW2019WS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +# geom_point())cor =with(df.lc,cor.test(PCTHAY2019WS,log10(chla)))p2 <-ggplotly(ggplot(df.lc,aes(PCTHAY2019WS,log10(chla),color=type,text=lake)) +theme_bw() +annotate('text',x =quantile(df.lc$PCTHAY2019WS[!is.na(df.lc$PCTHAY2019WS)],0.97), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +geom_point())subplot(p1,p2,margin=0.1,shareY = T)
code to create scatter plots for chlorophyll a
# cor = with(df.lc,cor.test(PCTHAY2019WS,log10(chla)))# p1 <- ggplotly(ggplot(df.lc,aes(PCTHAY2019WS,log10(chla),color=type,text=lake)) +# theme_bw() +# geom_point() +# annotate('text',x = quantile(df.lc$PCTHAY2019WS[!is.na(df.lc$PCTHAY2019WS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) + # theme(legend.position = 'none') +# ggtitle('Chlorophyll a vs %Hay and Hydrologic Conductivity'))# cor = with(df.lc,cor.test(HYDRLCONDWS,log10(chla)))# p2 <- ggplotly(ggplot(df.lc,aes(HYDRLCONDWS,log10(chla),color=type,text=lake)) +# theme_bw() +# annotate('text',x = quantile(df.lc$HYDRLCONDWS[!is.na(df.lc$HYDRLCONDWS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) + # geom_point())# # subplot(p1,p2,margin=0.1,shareY = T)cor =with(df.lc,cor.test(PCTURB2019WS,log10(chla)))p1 <-ggplotly(ggplot(df.lc,aes(PCTURB2019WS,log10(chla),color=type,text=lake)) +theme_bw() +geom_point() +theme(legend.position ='none') +annotate('text',x =quantile(df.lc$PCTURB2019WS[!is.na(df.lc$PCTURB2019WS)],0.75), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +ggtitle('Chlorophyll a vs %Developed and %Forest'))cor =with(df.lc,cor.test(PCTFOR2019WS,log10(chla)))p2 <-ggplotly(ggplot(df.lc,aes(PCTFOR2019WS,log10(chla),color=type,text=lake)) +theme_bw() +annotate('text',x =quantile(df.lc$PCTFOR2019WS[!is.na(df.lc$PCTFOR2019WS)],0.65), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +geom_point())subplot(p1,p2,margin=0.1,shareY = T)
Now we can take a look at total phosphorus…. The BRT model again identified maximum depth as the most important variable with a negative relationship with lake total phosphorus. The next two variables %Hay cover and %Soil organic matter (OMWS) were relatively unimportant. The CV R2 of this model was 0.48. Simplification of this model did not result in an improved CV R2.
The scatter plots indicate a statistically significant (p<0.05) negative relationship with watershed %Soil organic matter, but clearly no correlation with watershed %Forest or %Developed cover.
to perform exploratory data analysis for total phosphorus
### gbm + dismo - boosted regression tree packageslibrary(gbm)library(dismo)########################### total phosphorustmp <-filter(df.lc, !is.na(tp))set.seed(2974)gbm1 <-gbm.step(data = df.lc %>%mutate(tp =log10(tp)), gbm.x =c(12:56), gbm.y=c(10), family="gaussian", tree.complexity =5, learning.rate=0.001, bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# https://stackoverflow.com/questions/60488587/boosted-regression-trees-deviance-values# https://stats.stackexchange.com/questions/408119/which-method-is-correct-for-calculating-total-deviance-explained-for-boosted-reg# mean(gbm1$residuals^2) # mean residual difference - here 0.08# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance# gbm1$self.statistics$mean.null # here 0.149 mean total deviance# # # # R2paste0('R2 = ', round((gbm1$self.statistics$mean.null -mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.8"
to perform exploratory data analysis for total phosphorus
to perform exploratory data analysis for total phosphorus
# # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work## contributions of each predictor# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)par(mar=c(5,8,4,0)+.1)summary(gbm1,cBars =10,las=2,cex.names=0.8, plotit =TRUE)
to perform exploratory data analysis for total phosphorus
# library(scales)# source('gbm.plot.kc.R')# gbm.plot.kc(gbm1, n.plots=3, write.title = F,# plot.layout=c(3, 1),# common.scale = TRUE,# # smooth = TRUE,# show.contrib = TRUE)# set.seed(2974)# gbm1 <- gbm.step(data = df.lc %>% mutate(tp = log10(tp)), gbm.x = c(34,36,40,44), gbm.y=c(10),# gbm1 <- gbm.step(data = df.lc %>% mutate(tp = log10(tp)), gbm.x = c(33,57),# gbm.y=c(10),family="gaussian", tree.complexity = 5, learning.rate=0.001, bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)# # # # # R2# paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))# # # # # # # CV R2# paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))# # # # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work# ## # contributions of each predictor# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)cor =with(df.lc,cor.test(PCTHAY2019WS,log10(tp)))p1 <-ggplotly(ggplot(df.lc,aes(PCTHAY2019WS,log10(tp),color=type,text=lake)) +theme_bw() +annotate('text',x =quantile(df.lc$PCTHAY2019WS[!is.na(df.lc$PCTHAY2019WS)],0.97), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +geom_point()+ggtitle('Total Phosphorus vs %Hay and %Soil OM')) cor =with(df.lc,cor.test(OMWS,log10(tp)))p2 <-ggplotly(ggplot(df.lc,aes(OMWS,log10(tp),color=type,text=lake)) +theme_bw() +annotate('text',x =quantile(df.lc$OMWS[!is.na(df.lc$OMWS)],0.9), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +geom_point())# p1 <- ggplotly(ggplot(df.lc,aes(PCTHBWET2019WS,log10(tp),color=type,text=lake)) +# theme_bw() +# geom_point() +# stat_cor(method='pearson') +# theme(legend.position = 'none') +# annotate('text',x = quantile(df.lc$PCTHBWET2019WS[!is.na(df.lc$PCTHBWET2019WS)],0.95), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) +# ggtitle('Total Phosphorus vs %Herbaceous Wetland and %Hay'))subplot(p1,p2,margin=0.1,shareY = T)
to perform exploratory data analysis for total phosphorus
cor =with(df.lc,cor.test(PCTURB2019WS,log10(tp)))p1 <-ggplotly(ggplot(df.lc,aes(PCTURB2019WS,log10(tp),color=type,text=lake)) +theme_bw() +geom_point() +theme(legend.position ='none') +annotate('text',x =quantile(df.lc$PCTURB2019WS[!is.na(df.lc$PCTURB2019WS)],0.75), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +ggtitle('Total Phosphorus vs %Developed and %Forest'))cor =with(df.lc,cor.test(PCTFOR2019WS,log10(tp)))p2 <-ggplotly(ggplot(df.lc,aes(PCTFOR2019WS,log10(tp),color=type,text=lake)) +theme_bw() +annotate('text',x =quantile(df.lc$PCTFOR2019WS[!is.na(df.lc$PCTFOR2019WS)],0.7), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +geom_point())subplot(p1,p2,margin=0.1,shareY = T)
Now we can take a look at Secchi depth. The top explanatory variable was maximum lake depth. The next two variables with relative importance greater than 6 were the %Herbaceous and %Woody wetland cover metrics. The CV R2 of this model was 0.43. Simplification of this model did not result in an improved CV R2.
The scatter plots show that there is a statistically significant (p<0.05) negative relationship with watershed %Herbaceous and %Woody wetland cover, but clearly no correlation with watershed %Forest or %Developed cover.
to perform exploratory data analysis for secchi depth
### gbm + dismo - boosted regression tree packageslibrary(gbm)library(dismo)########################### secchi depthtmp <-filter(df.lc, !is.na(secchi))set.seed(2974)gbm1 <-gbm.step(data = df.lc %>%mutate(secchi =log10(secchi)) %>%filter(!is.na(secchi)), gbm.x =c(12:56), gbm.y=c(7), family="gaussian", tree.complexity =5, learning.rate=0.001, bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# https://stackoverflow.com/questions/60488587/boosted-regression-trees-deviance-values# https://stats.stackexchange.com/questions/408119/which-method-is-correct-for-calculating-total-deviance-explained-for-boosted-reg# mean(gbm1$residuals^2) # mean residual difference - here 0.08# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance# gbm1$self.statistics$mean.null # here 0.149 mean total deviance# # # R2paste0('R2 = ', round((gbm1$self.statistics$mean.null -mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
[1] "R2 = 0.79"
to perform exploratory data analysis for secchi depth
to perform exploratory data analysis for secchi depth
# # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work## contributions of each predictor# summary(gbm1,cBars = 10,las=2,cex.names=0.6, plotit = FALSE)par(mar=c(5,8,4,0)+.1)summary(gbm1,cBars =10,las=2,cex.names=0.8, plotit =TRUE)
to perform exploratory data analysis for secchi depth
# cor = with(df.lc,cor.test(PCTOW2019WS,log10(secchi)))# p1 <- ggplotly(ggplot(df.lc,aes(PCTOW2019WS,log10(secchi),color=type,text=lake)) +# theme_bw() +# geom_point() +# annotate('text',x = quantile(df.lc$PCTOW2019WS[!is.na(df.lc$PCTOW2019WS)],0.9), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,4),sep=" ")) + # theme(legend.position = 'none') +# # ggtitle('secchi vs %Open Water and Baseflow Index'))# ggtitle('secchi vs %Open Water'))# cor = with(df.lc,cor.test(BFIWS,log10(secchi)))# p2 <- ggplotly(ggplot(df.lc,aes(BFIWS,log10(secchi),color=type,text=lake)) +# theme_bw() +# annotate('text',x = quantile(df.lc$BFIWS[!is.na(df.lc$BFIWS)],0.5), y = 2.0, label = paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) + # geom_point())# # # subplot(p1,p2,margin=0.1,shareY = T)# p1cor =with(df.lc,cor.test(PCTURB2019WS,log10(secchi)))p1 <-ggplotly(ggplot(df.lc,aes(PCTURB2019WS,log10(secchi),color=type,text=lake)) +theme_bw() +geom_point() +theme(legend.position ='none') +annotate('text',x =quantile(df.lc$PCTURB2019WS[!is.na(df.lc$PCTURB2019WS)],0.75), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +ggtitle('secchi vs %Developed and %Forest'))cor =with(df.lc,cor.test(PCTFOR2019WS,log10(secchi)))p2 <-ggplotly(ggplot(df.lc,aes(PCTFOR2019WS,log10(secchi),color=type,text=lake)) +theme_bw() +annotate('text',x =quantile(df.lc$PCTFOR2019WS[!is.na(df.lc$PCTFOR2019WS)],0.75), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,2),sep=" ")) +geom_point())subplot(p1,p2,margin=0.1,shareY = T)
Even the watershed wetland variables explain very little in the variance in lake trophic state across the Puget Lowlands of Puget Sound. Is lake trophic state more a product of a legacy of change from forest clearing, agricultural development, to urbanization? Would sewer vs septic explain more of the variance? Nora also found that hydraulic retention time was correlated with lake trophic state. Would hydraulic retention time explain as much or more of the variance explained by maximum lake depth?
Some suggested reading:
Kammer (2004). Lake Water Quality and Health: Land Use Considerations for Small Lake Watersheds in King County, Washington. King County, WA.
Politi et al. (2024) A global typological approach to classify lakes based on their eutrophication risk. Aquatic Sciences 86:52.
Read et al. (2015) The importance of lake-specific characteristics for water quality across the continental United States. Ecological Applications 25(4):943-955.
Sabo et al. (2023) Comparing drivers of spatial variability in U.S. lake and stream phosphorus concentrations. J. Geophys. Res. Biogeosci. 128(8).
Addendum - Quick look at relationship between lake trophic state and septic system density
To do this I used the latest septic system coverage for King County and looked at the correlation of lake watershed septic system density (OSSs per hectare: OSS_HA) with lake trophic state indicators as determined above. One of several possible limitations is that I do not currently have individual watershed delineations for every King County lake and some lake watersheds extend into Snohomish County (Snohomish County provided septic system data!). I had to limit this analysis to King County lakes because septic data for the entire Puget Lowland of Puget Sound is unlikely to ever become available, although it might be possible in the future to do as Moore et al. (2003) did and contact each local jurisdiction for an indication of whether or not the lake should be classified as served by sewers or septic systems or was considered undeveloped.
I also tried to look at the sum of the inverse distance of septic systems to the lake perimeter, but this didn’t seem to provide any improvement in the relationships with lake trophic state.
One next step would be to collect watershed delineations for Snohomish lakes and the missing King Co. lake delineations to increase the size of the dataset available for analysis.
Although there seems to be a hint of the expected relationship…increasing septic system density leads to higher lake trophic state that hypothesis is not strongly supported by the data. The greatest support appears for summer lake nitrogen concentration which is consistent with nitrate being the most mobile nutrient released from working septic systems.
Note, septic systems do not need to “fail” to release nutrients to local soils and groundwater. That is what working septic systems are designed to do - break down organic matter into water, carbon dioxide, and soluble nutrients. A “failing” septic system is one that releases untreated organic waste to soils or water (surface or ground).
to perform exploratory data analysis for lake watershed septic density
### the operation to load the King Co septic data is too time consuming to include here so I will show the work but read the processed data# oss_poly <- st_read("//gisdw/KCLIB/Plibrary2/utility/shapes/polygon/septic_onsite_parcel.shp") # oss_points <- st_centroid(oss_poly) %>% filter(ExpExclude == "OSS") %>%# st_transform(4269)# # lake_ws <- st_read("C:/Users/degaspec/OneDrive - King County/gis_external/small_lakes/watersheds_all.shp") %>%# st_transform(4269)# # unique(oss_points$ExpExclude)# # sewer_line <- st_read("//gisdw/KCLIB/Plibrary2/utility/shapes/arc/sewer.shp") %>%# st_transform(4269)# # sewer_serv <- st_read("//gisdw/KCLIB/Plibrary2/utility/shapes/polygon/sewer_serv.shp") %>%# st_transform(4269)# # lake_ws_oss <- st_filter(oss_points,lake_ws)# lake_ws_oss.i <- st_intersection(oss_points,lake_ws)# there is a different result depending on the use of filter or intersection. Choosing the intersection result for now.# mapview(lake_ws_oss, size = 0.5) + mapview(lake_ws, alpha.regions = 0.2) + mapview(sewer, alpha.regions = 0.5, color = 'pink') +# mapview(sewer_serv, alpha.regions = 0.2, col.regions = 'pink')# # mapview(lake_ws_oss.i, size = 0.5) + mapview(lake_ws, alpha.regions = 0.2) + mapview(sewer, alpha.regions = 0.5, color = 'pink') +# mapview(sewer_serv, alpha.regions = 0.2, col.regions = 'pink')# lake_ws_oss_dens <- st_drop_geometry(lake_ws_oss.i) %>% data.frame()# lake_ws_oss_dens <- lake_ws_oss_dens %>% group_by(SOURCETHM) %>% summarize(OSS_HA = n()/HECTARES, n = n(), HECTARES = HECTARES) %>% ungroup() %>% distinct()# unique(lake_ws_oss.i$SOURCETHM)# # lake_ws_oss_dens <- mutate(lake_ws_oss_dens,lake = str_replace(SOURCETHM,"_s.*",""))# # saveRDS(lake_ws_oss_dens,'lake_ws_oss_dens.rds')lake_ws_oss_dens <-readRDS('lake_ws_oss_dens.rds')kc.lakes <- dplyr::filter(df,type=='King Co')lake_ws_oss_dens <-mutate(lake_ws_oss_dens, lake =ifelse(lake =="Echo","Echo-Shoreline",ifelse(lake =='Beaver1',"Beaver-1",ifelse(lake =='Beaver2','Beaver-2',ifelse(lake =='Neilson', "Neilson (Holm)",ifelse(lake =='Mcdonald','McDonald', lake))))))tmp <-data.frame(OSS_HA =0, n =0, lake ="Green-1" )lake_ws_oss_dens <-bind_rows(lake_ws_oss_dens,tmp)# kcsites <- read_csv('kcsites.csv') %>% transmute(lake=Lake, Locator=SITE_ID, lat = LAT_DD83, lon = LON_DD83)kcsites <-read_csv('kcsites.csv') %>%transmute(lake=Lake, Locator=SITE_ID, depth_max_ft = depth_max_ft)# kcsites$Sewer_status <- fct_collapse(kcsites$Sewer_status, Sewered = c("Sewered","Fringe"))tmp2 <-left_join(kc.lakes,kcsites,by="lake")# tmp2 <- left_join(kc.lakes,lake_ws_oss_dens,by='lake')tmp2 <-left_join(tmp2,lake_ws_oss_dens,by='lake')cor =with(tmp2,cor.test(OSS_HA,log10(chla)))p1 <-ggplot(tmp2,aes(OSS_HA,log(chla))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$OSS_HA,0.9,na.rm=T), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Chlorophyll a vs OSS density (#/ha)') # )cor =with(tmp2,cor.test(OSS_HA,log10(tp)))p2 <-ggplot(tmp2,aes(OSS_HA,log(tp))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$OSS_HA,0.9,na.rm=T), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Total Phosphorus vs OSS density (#/ha)') #)cor =with(tmp2,cor.test(OSS_HA,log10(tn)))p3 <-ggplot(tmp2,aes(OSS_HA,log(tn))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$OSS_HA,0.9,na.rm=T), y =6.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Total Nitrogen vs OSS density (#/ha)') #)cor =with(tmp2,cor.test(OSS_HA,log10(secchi)))p4 <-ggplot(tmp2,aes(OSS_HA,log(secchi))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$OSS_HA,0.9,na.rm=T), y =1.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('secchi vs OSS density (#/ha)') #)(p1+p2)/(p3+p4)
to perform exploratory data analysis for lake watershed septic density
I also tried placing each lake watershed into three categories: wholly within the regional sewer system, straddling the boundary (fringe), and wholly outside of the regional sewer system. The differences were consistent with the conclusions of Moore et al. (2003).
Multiple comparisons indicated lower trophic state within and outside of the regional sewer service area and highest for lakes on the fringe. Lakes within the fringe would have fewer septics systems and regional sewer service and lake outside of the fringe would be least developed and have the fewest number of septics systems. Lakes on the fringe would be have relatively high development and the largest number of septic systems. However, none of the apparent differences were statistically significant.
I also checked to see if there were any systematic differences in lake depth in each of the categories. It looks like lake depths were similarly distributed within each category. Lake depth seems like a variable that needs to be considered in any statistical evaluation of the spatial variability of trophic state.
to perform exploratory data analysis for lake watershed septic density - alt
to perform exploratory data analysis for lake watershed septic density - alt
sample_size <- tmp2 %>%group_by(Sewer_status) %>%summarize(num=n())tmp2 <-left_join(tmp2, sample_size, by="Sewer_status") %>%mutate(Sewer_status =paste0(Sewer_status, "\n", "n=", num)) %>%# hard coded converting new "Sewer_status = n" to ordered factormutate(Sewer_status =factor(Sewer_status,ordered=T,levels =c("Sewered\nn=18","Fringe\nn=8","Unsewered\nn=16")))res.kruskal <- coin::kruskal_test(chla ~ Sewer_status, data = tmp2)res.kruskal
Asymptotic Linear-by-Linear Association Test
data: chla by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = 1.0639, p-value = 0.2874
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2, aes(x=Sewer_status,y=chla)) +theme_bw() +geom_boxplot() +labs(x="", y="Chlorophyll a (µg/L)") +ggtitle("Regional sewer status and Chlorophll a") +stat_compare_means()
to perform exploratory data analysis for lake watershed septic density - alt
res.kruskal <- coin::kruskal_test(tp ~ Sewer_status, data = tmp2)res.kruskal
Asymptotic Linear-by-Linear Association Test
data: tp by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.22399, p-value = 0.8228
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2,aes(x=Sewer_status,y=tp)) +theme_bw() +geom_boxplot() +labs(x="", y="Total Phosphorus (µg/L)") +ggtitle("Regional sewer status and Total Phosphorus") +stat_compare_means()
to perform exploratory data analysis for lake watershed septic density - alt
res.kruskal <- coin::kruskal_test(tn ~ Sewer_status, data = tmp2)res.kruskal
Asymptotic Linear-by-Linear Association Test
data: tn by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = 0.88195, p-value = 0.3778
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2,aes(x=Sewer_status,y=tn)) +theme_bw() +geom_boxplot() +labs(x="", y="Total Nitrogen (µg/L)") +ggtitle("Regional sewer status and Total Nitrogen") +stat_compare_means()
to perform exploratory data analysis for lake watershed septic density - alt
res.kruskal <- coin::kruskal_test(secchi ~ Sewer_status, data = tmp2)res.kruskal
Asymptotic Linear-by-Linear Association Test
data: secchi by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -1.0219, p-value = 0.3068
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2,aes(x=Sewer_status,y=secchi)) +theme_bw() +geom_boxplot() +labs(x="", y="Secchi transparency (m)") +ggtitle("Regional sewer status and Secchi transparency") +stat_compare_means()
to perform exploratory data analysis for lake watershed septic density - alt
ggplot(tmp2, aes(x=Sewer_status,y=depth_max_ft)) +theme_bw() +geom_boxplot() +labs(x="", y="Maximum depth (ft)") +ggtitle("Regional sewer status and lake depth") +stat_compare_means()
to perform exploratory data analysis for lake watershed septic density - alt
Asymptotic General Independence Test
data: chla by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = 0.90362, p-value = 0.3662
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
independence_test(data = tmp2, tp ~ Sewer_status)
Asymptotic General Independence Test
data: tp by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.17386, p-value = 0.862
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
independence_test(data = tmp2, tn ~ Sewer_status)
Asymptotic General Independence Test
data: tn by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.051213, p-value = 0.9592
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
Asymptotic General Independence Test
data: secchi by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -1.3118, p-value = 0.1896
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
Asymptotic General Independence Test
data: depth_max_ft by
Sewer_status (Sewered
n=18 < Fringe
n=8 < Unsewered
n=16)
Z = -0.43215, p-value = 0.6656
alternative hypothesis: two.sided
to perform exploratory data analysis for lake watershed septic density - alt
Group Letter MonoLetter
1 Sewered\nn18 a a
2 Fringe\nn8 a a
3 Unsewered\nn16 a a
to perform exploratory data analysis for lake watershed septic density - alt
# ancova to control for lake depth in multiple comparisons# https://www.datanovia.com/en/lessons/ancova-in-r/res.aov <- tmp2 %>%anova_test(secchi ~ depth_max_ft + Sewer_status)get_anova_table(res.aov)
ANOVA Table (type II tests)
Effect DFn DFd F p p<.05 ges
1 depth_max_ft 1 36 22.935 2.86e-05 * 0.389
2 Sewer_status 2 36 1.264 2.95e-01 0.066
to perform exploratory data analysis for lake watershed septic density - alt
Anova Table
Resampling test using freedman_lane to handle nuisance variables and 1e+05 permutations.
SS df F parametric P(>F) resampled P(>F)
Sewer_status 13.06 2 2.346 0.1091 0.108
Residuals 108.52 39
to perform exploratory data analysis for lake watershed septic density - alt
res.pavo <-aovperm(secchi ~ depth_max_ft*Sewer_status, data = tmp2, np =100000)res.pavo
Anova Table
Resampling test using freedman_lane to handle nuisance variables and 1e+05 permutations.
SS df F parametric P(>F) resampled P(>F)
depth_max_ft 15.800 1 8.9246 0.005192 0.00395
Sewer_status 0.686 2 0.1937 0.824778 0.82694
depth_max_ft:Sewer_status 3.145 2 0.8883 0.420700 0.42542
Residuals 60.192 34
Addendum - Quick look at relationship between lake trophic state and maximum lake depth
After reviewing one of the suggested readings above (Read et al. 2015), I found that lake depth might explain most of the variability in observed lake trophic state. Here I just use the King County lake data. Fortunately, I found a source of King County (and Snohomish Co) lake physical data that I suspect has been used in the past to populate webpage and report tables - Bortleson et al. (1976). The Bortleson report also contains a treasure trove of lake water quality observations that might also be of some use.
to perform exploratory data analysis for maximum lake depth
#### let's look @ some lake physical characteristicslkphys <-read_csv("C:/Users/degaspec/OneDrive - King County/R/NLA/kcsites.csv") %>%rename(lake = Lake)tmp2 <-left_join(tmp2,lkphys,by='lake')cor =with(tmp2,cor.test(MaximumDepth,log10(chla)))p1 <-ggplot(tmp2,aes(MaximumDepth,log(chla))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$MaximumDepth,0.9), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Chlorophyll a vs Maximum Lake Depth')cor =with(tmp2,cor.test(MaximumDepth,log10(tp)))p2 <-ggplot(tmp2,aes(MaximumDepth,log(tp))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$MaximumDepth,0.9), y =3.5, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Total Phosphorus vs Maximum Lake Depth')cor =with(tmp2,cor.test(MaximumDepth,log10(tn)))p3 <-ggplot(tmp2,aes(MaximumDepth,log(tp))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$MaximumDepth,0.9), y =3.5, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Total Nitrogen vs Maximum Lake Depth')cor =with(tmp2,cor.test(MaximumDepth,log10(secchi)))p4 <-ggplot(tmp2,aes(MaximumDepth,log(secchi))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp2$MaximumDepth,0.9), y =2.25, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('secchi Depth vs Maximum Lake Depth')(p1+p2)/(p3+p4)
I requested and received some maximum lake depth (actually index site sample depth, but similar enough I think since the index site target is the deepest part of each lake) data for the National Lakes Assessment Lakes. I find a similar correlation between index site sample depth and trophic state indicators.
to perform exploratory data analysis for NLA idex site depth
#### let's look @ some NLA lake physical characteristicsnla.lkphys <-read_csv("C:/Users/degaspec/OneDrive - King County/R/NLA/data/NLA_WA_2007-2022_lakes_depth_area.csv") # The NLA data files do not include lake names so I had to first filter the national data to lakes in the Puget Lowland and then identified lake names using Google Mapsnlasites <-read_csv('sites_lake_names.csv') %>%rename(lake = Lake)nla2 <-left_join(nlasites,nla.lkphys,by="SITE_ID") %>%mutate(INDEX_SITE_DEPTH =as.numeric(INDEX_SITE_DEPTH)) %>%group_by(lake) %>%summarize(AREA_HA =mean(AREA_HA,na.rm=TRUE), PERIM_KM =mean(PERIM_KM,na.rm=TRUE), MaximumDepth =mean(INDEX_SITE_DEPTH,na.rm=TRUE), WSAREASQKM =mean(WSAREASQKM,na.rm=TRUE)) %>%ungroup() %>%mutate(Type ="NLA")# tmp3 <- left_join(nla,df,by='lake')tmp3 <-left_join(nla2,nla,by='lake')cor =with(tmp3,cor.test(MaximumDepth,log10(chla)))p1 <-ggplot(tmp3,aes(MaximumDepth,log(chla))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp3$MaximumDepth,0.9), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Chlorophyll a vs Index Site Depth')cor =with(tmp3,cor.test(MaximumDepth,log10(tp)))p2 <-ggplot(tmp3,aes(MaximumDepth,log(tp))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp3$MaximumDepth,0.9), y =3.5, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,7),sep=" ")) +ggtitle('Total Phosphorus vs Index Site Depth')cor =with(tmp3,cor.test(MaximumDepth,log10(tn)))p3 <-ggplot(tmp3,aes(MaximumDepth,log(tp))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp3$MaximumDepth,0.9), y =3.5, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Total Nitrogen vs Index Site Depth')cor =with(tmp3,cor.test(MaximumDepth,log10(secchi)))p4 <-ggplot(tmp3,aes(MaximumDepth,log(secchi))) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp3$MaximumDepth,0.9), y =2.25, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('secchi Depth vs Index Site Depth')(p1+p2)/(p3+p4)
I combined the data to see what that might look like… Similar relationships with somewhat different slopes with the possible exception of chlorophyll a. It also looks like the NLA lakes include lakes that are shallower than some King County lakes, but there are a number of King County lakes that are deeper than any of the NLA lakes in the Puget Lowland of Puget Sound.
to perform exploratory data analysis for combined lake depth data
tmp4 <-bind_rows(tmp2,tmp3)cor =with(tmp4,cor.test(MaximumDepth,log10(chla)))p1 <-ggplot(tmp4,aes(MaximumDepth,log(chla),color=Type)) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp4$MaximumDepth,0.9), y =2.0, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Chlorophyll a vs Index Site Depth')cor =with(tmp4,cor.test(MaximumDepth,log10(tp)))p2 <-ggplot(tmp4,aes(MaximumDepth,log(tp),color=Type)) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp4$MaximumDepth,0.9), y =3.5, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,7),sep=" ")) +ggtitle('Total Phosphorus vs Index Site Depth')cor =with(tmp4,cor.test(MaximumDepth,log10(tn)))p3 <-ggplot(tmp4,aes(MaximumDepth,log(tp),color=Type)) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +theme(legend.position ='none') +annotate('text',x =quantile(tmp4$MaximumDepth,0.9), y =3.5, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('Total Nitrogen vs Index Site Depth')cor =with(tmp4,cor.test(MaximumDepth,log10(secchi)))p4 <-ggplot(tmp4,aes(MaximumDepth,log(secchi),color=Type)) +geom_point() +# geom_quantile(formula = y~x, quantiles = 0.9) +geom_smooth(formula = y~x, method ='lm') +# theme(legend.position = 'none') +annotate('text',x =quantile(tmp4$MaximumDepth,0.9), y =2.25, label =paste("R =",round(cor$estimate,3)," p =",round(cor$p.value,5),sep=" ")) +ggtitle('secchi Depth vs Index Site Depth')(p1+p2)/(p3+p4)
And now a preliminary look at some similar analyses using data collected by the USGS on many, many lakes in the Puget Sound basin in the early 1970s…starting with data for King and Snohomish County lakes.
As I noted above, Bortleson et al. (1976) contains a treasure trove of data ranging from watershed land use, lake physical characteristics, and water quality data (single summer samples from the early 1970s). The water quality data include TP, TN (derived by summing N forms), Secchi depth, and Color (PCU), but sadly no chlorophyll a measurements.
It took a few days to enter the data for 362 lakes. After filtering these lakes to those in the Puget Lowland and excluding some lakes because of they weren’t small lakes (e.g., Lake Union and Portage Bay in King County), they were saline, they were mostly covered by emergent vegetation (they were really wetlands), or they were reservoirs on a larger river system. This screening resulted in 226 Puget Lowland lakes.
I conducted a separate analysis of these data that can be found here.
Please take a look, but what I found from that analysis was that the variables that best explained Summer lake surface TP in order of importance (most to least) were water color, maximum lake depth, and %Agriculture. The variables that best explained Summer lake Secchi depth in order of importance (most to least) were water color and maximum lake depth.