Over 20 years ago, Nora Kammer (2004) reported on analyses of King County Small Lakes data to address three hypotheses:
As the proportion of high density land uses in lake watersheds increase, lakes will become more eutrophic.
As the proportion of forested land uses in lake watersheds increase, lakes will become less eutrophic.
Septic systems within lake watersheds will cause lakes to become more eutrophic.
Although not explicitly stated, Nora’s approach is often described as a space-for-time approach. Nora looked for correlations between independent variables (e.g., watershed %forest) and dependent variables (trophic state indicators chlorophyll a, total phosphorus, and Secchi depth) at specific points in time.
The space-for-time approach uses spatial gradients of independent variables as a proxy for temporal changes in dependent variables. This approach is typically used because data were not collected at one or more individual locations over a period of time covering significant changes in the independent variables of interest.
Nora’s results were quite striking. High density development in a lake’s watershed was negatively correlated with lake trophic state indicators (higher density development ~ lower trophic state). and forest cover was positively correlated with trophic state indicators (higher forest cover ~ higher trophic state). A less counter intuitive finding was that lakes with sewered watersheds (i.e., few or no septic systems) tended to be less eutrophic.
I’ve been reporting elsewhere about a revisit of Nora’s analyses using more recent King County, Snohomish County, Institute of Watershed Studies (primarily Whatcom County), and National Lakes Assessment data. Here I describe a revisit of Nora’s first two questions using Puget Sound basin lake and watershed data collected by the USGS in the 1970s (Bortleson et al. 1976).
The USGS data ranged from watershed land use, lake physical characteristics, to water quality data (single summer surface ~ 1 m samples from the early 1970s). The water quality data included TP, TN (derived by summing N forms), Secchi depth, and Color (PCU), but sadly no chlorophyll a measurements. The physical lake characteristics included lake area, volume, mean and maximum depth, shoreline length, shoreline configuration, development volume, and bottom slope. The watershed data included % shoreline residential development, number of nearshore homes, % agriculture and forest land use, and the % of watershed suburban and urban residential land use. I summed the % of suburban and urban residential land use into a % residential land use since most (%81) of the % residential urban values were zero - 21% of the combined residential values were zero.
Physical lake characteristics shoreline configuration (SC) and development volume (DV) probably require further explanation. Higher values of either of these metrics indicate greater contact of sediment with shallow water. Specifically, shoreline configuration is the ratio of the length of the shoreline to the circumference of a circle having an area equal to that of the lake so a circular lake has a shoreline configuration of one. Development volume is the ratio of the mean depth to the maximum depth. Shallow lakes with a high value for development volume will have greater exposure of shallow bottom sediments.
From the data I also calculated the density of shoreline homes (number of nearshore homes divided by shoreline length; SH_dens) and the watershed area to lake area ratio (WA_LA) as potential explanatory variables. I thought the density of shoreline homes would be a better explanatory variable than just the number of homes around lakes of varying size. Watershed area to lake area ratio is thought to be an important factor regulating lake trophic state - a larger ratio indicates a potentially larger contribution of material to the lake from the watershed relative to its size, although a larger ratio also indicates a potentially larger amount of lake flushing/shorter water residence time.
Methods
It took a few days to enter the data for 362 lakes. After filtering these lakes to those in the Puget Lowland and excluding some lakes because of they weren’t small lakes (e.g., Lake Union and Portage Bay in King County), they were saline, they were mostly covered by emergent vegetation (they were really wetlands), or they were reservoirs on a larger river system. This screening resulted in 226 Puget Lowland lakes.
I used boosted regression tree (BRT) models to identify the relative importance of the selected variables that explain the variation in total phosphorus and Secchi depth (log10 transformed) in this dataset. BRT models are a class of machine learning models that combine the results of sequentially fit random forest regression trees. BRT models have been shown to have superior performance over linear modeling techniques particularly in the case of highly skewed environmental data.
I selected nine potential explanatory variables - maximum lake depth (Depth_max_ft), shoreline configuration (SC), development volume (DV), % shoreline residential development (Res_sl_pct), % watershed agriculture (Ag_pct), % watershed forest (For_pct), watershed % residential development (Res_pct), % of watershed covered by lake area (LSA_pct), and water color (Color_PCU). I plotted a Spearman rank correlation matrix to evaluate colinearity among these variables and did some preliminary BRT model testing to land on what I believed was the most parsimonious approach to model development. Model development generally followed methods outlined by Waite et al. (2019). Model development started with an initial full model followed by model simplification steps that resulted in a final best model for log10(TP) and log10(Secchi).
code to load and process the lake data
library(tidyverse)library(lubridate)library(sf)library(mapview)### gbm + dismo - boosted regression tree packageslibrary(gbm)library(dismo)library(conflicted)conflict_prefer("filter", "dplyr", quiet = T)conflict_prefer("select", "dplyr", quiet = T)library(readxl)source('gbm.plot.kc.R')### get PL ecoregion boundary....ecoregion <-st_read("C:/Users/degaspec/OneDrive - King County/gis_external/EPA_Region10_ecoregions/reg10_eco_l3.shp",quiet =TRUE) %>%st_transform(4269)# st_crs(ecoregion)pugetl <- ecoregion %>%filter(US_L3NAME =="Puget Lowland") %>% dplyr::select(US_L3NAME) %>%st_transform(4269)usgs.p <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/bortleson_usgs_stats_1976.xlsx",'data', na ="NA") %>%mutate(lat = LAT_D + LAT_M/60+ LAT_S/60/60, lon =-(LON_D + LON_M/60+ LON_S/60/60), Res_pct = Res_urb_pct+Res_sub_pct)# remove > for Secchi.....consider replacing with NADA method in futureusgs.wq <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/bortleson_usgs_stats_1976.xlsx",'water quality') %>%filter(Depth_ft ==3) %>%filter(SampleSite==1) %>%# to remove second Fawn Lake sample site# mutate(TN = TNO2+TNO3+TNH3+TON, Secchi_ft = as.numeric(gsub(">",'',Secchi_ft)))rowwise() %>%mutate(TN =sum(TNO2,TNO3,TNH3,TON,na.rm=TRUE), Secchi_ft =as.numeric(gsub(">",'',Secchi_ft)))# remove < for FC_mean.....consider replacing with NADA method in futureusgs.fc <-read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/bortleson_usgs_stats_1976.xlsx",'fecal coliform') %>%mutate(Date_fc = Date, Time_fc = Time, FC_mean =as.numeric(gsub(">",'',FC_mean))) %>%select(-SampleSite,-Date,-Time)# Loop has no wq datausgs <-left_join(usgs.p,usgs.wq,by=c("County","Lake")) # presume drainage area includes lake surface as implied by the provided lake surface percent definition and values consistent with lake area/drainage areausgs <-left_join(usgs,usgs.fc,by=c("County","Lake")) %>%filter(!Lake %in%c('Loop','Portage Bay','Union')) %>%mutate(Inflow =ifelse(Lake=="Fenwick","Perennial",Inflow), SH_dens = Homes_ns/SL_mi, WA_LA = (DA_sqmi*640)/LA_ac, Ag =factor(ifelse(Ag_pct==0,0,1)),Inflow =factor(Inflow), Outflow_channel =factor(Outflow_channel))# and remove 2 lakes in Bellevue with crazy high TP - Sturtevant and Larsen# usgs <- usgs %>% filter(!Lake %in% c('Sturtevant','Larsen'))# and remove Lake Tapps because it doesn't have any land use data and Mud Mountain, LaGrande, and Cushman because they are reservoirs with most of drainage high elevation...not lowland lakesusgs <- usgs %>%filter(!Lake %in%c('Tapps','Mud Mountain','LaGrande','Cushman','Cushman (Lower)'))# and remove Crockett (Island Co) because it is hypersaline and Tennant (Whatcom) because it looks more like a wetland than a lake and Shannon (Skagit) because it is a hydropower reservoir....'Unnamed (Island) lakes and Kai Tai (Jefferson) also hypersaline lakesusgs <- usgs %>%filter(!Lake %in%c('Shannon','Tennant','Crockett','Unnamed (31N-1E-6)','Unnamed (33N-2E-7)','Kah Tai'))# WHAT to do about Samish? (2 basins) only use largest? usgs <- usgs %>%filter(!Lake %in%c('Samish (West Arm)','Samish (East Arm)'))# remove Capitol Lake...Percival Cove usgs <- usgs %>%filter(!Lake %in%c('Capitol','Percival Cove','Patterson (North Arm)'))### intersect puget lowland shape with sampling pointsusgs <-st_as_sf(usgs, coords =c("lon", "lat"), remove =FALSE)st_crs(usgs) <-4269usgs <-st_intersection(usgs,pugetl) %>%st_drop_geometry() %>%data.frame()###################################################################################
Results
Let’s start by acknowledging that LSA_pct is correlated strongly with WA_LA (R=-0.95) and WA_LA is highly left skewed while LSA_pct is less so (not shown). The relative proportion of the lake surface to watershed area (LSA_pct) is considered to be an important determinant of lake trophic state so I’ll keep LSA_pct.
Now I’ll consider the eleven remaining candidate explanatory variables with R >0.6 (see the correlation matrix plot below). Moving left to right Slope_pct and Depth_max_ft are correlated (R=0.76) and both are left skewed. I’ll keep Depth_max_ft since it has been shown to be important explanatory variable in this type of analysis before and drop slope_pct. Not surprisingly, Res_sl_pct is very strongly correlated with SH_dens (R=0.96), but these two variables are also correlated with Res_pct (R=0.75 and 0.77, respectively). Correlation across different scales is a persistent problem when trying to evaluate effects of scale (e.g., local vs watershed) - the same metrics at local and watershed scale are often highly correlated. I’ll drop SH_dens because it has a more skewed distribution than Res_sl_pct. Res_pct is correlated with For_pct (R=-0.62). Regardless, I’m going to keep For_pct and Res_pct and look at their interactions in the model.
I suspect color might also be a potentially important variable so I will include it. After this initial screening of candidate variables, I have nine potential explanatory variables that I can test in my BRT models. Unfortunately seven lakes did not have associated surface water color measurements so the number of independent lakes available to model was 209.
Spearman rank correlation matrix of the eleven candidate explanatory variables. The distribution of each variable is shown on the diagonal. Above the diagonal the correlation coefficient and significance level as stars - p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(“”, “”, “”, “.”, ” “). Below the diagonal the bivariate scatter plot with a fitted loess line is displayed.
Spearman rank correlation matrix
# Based on correlations keep SH_dens and drop Res_sl_pct and SH_dens. Drop WA_LA and keep LSA_pct. # Keep Res_pct even though it is correlated with For_pct.# Drop slope_pct and keep Depth_max_ft. Drop SC and keep DV.# also replace Ag_pct with a factor for Ag = 0 = 0 or Ag > 0 = 1 (presence absence of agriculture in watershed)# Final 6: Depth_max_ft (14), DV (17), For_pct (27), LSA_pct (28), Res_pct (32), Ag presence/absence (61)
Total Phosphorus model
The next step is to fit the TP model to the nine potential explanatory variables. First, a map to illustrate the variability of 1970s lake surface TP values across the Puget Lowland of Puget Sound.
The cross validation (CV) R2 of this model is 0.50. Not too shabby. This is comparable to CV R2s reported by Read et al. (2015) in a similar study using lake data for the contiguous U.S. Note that the CV R2 is preferred over the conventional R2 because it is more conservative. In this case it was based on holding 25% (bag fraction) of the lakes out for each regression tree split. The withheld sites are then used to test the percentage of deviance explained by the split.
TP boosted regression tree model
tmp <-mutate(usgs,TP =log10(TP*1000), TN =log10(TN*1000), Secchi_ft =log10(Secchi_ft)) %>%data.frame()# remove missing water color valuestmp <- tmp %>%filter(!is.na(Color_PCU))############################################################################################################ total phosphorus model########################################################################################################## initial model to optimize number of treesset.seed(2974)gbm1 <-gbm.step(data = tmp, gbm.x =c(15,17,18,23,27,28,29,37,50), gbm.y=c(46), family="gaussian", tree.complexity =5, learning.rate=0.001,bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# mean(gbm1$residuals^2) # mean residual difference - here 0.08# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance# gbm1$self.statistics$mean.null # here 0.149 mean total deviance# # # R2paste0('R2 = ', round((gbm1$self.statistics$mean.null -mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
# # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work#
The relative variable importance plot indicates that Color_PCU is the most important variable followed by Depth_max_ft, Ag_pct, Res_sl_pct, Res_pct, For_pct, SC, DV, and LSA_pct in descending order. I think we should simply prune this model by removing the variables with relative importance less than 6.
initial TP model variable importance plot
# # contributions of each predictorpar(mar=c(5,8,4,0)+.1)summary(gbm1, cBars =10, las=2, cex.names=0.8, plotit =TRUE)
Log10(TP) model relative variable importance plot.
var rel.inf
Color_PCU Color_PCU 39.028103
Depth_max_ft Depth_max_ft 28.462399
Ag_pct Ag_pct 9.405917
Res_sl_pct Res_sl_pct 5.772853
Res_pct Res_pct 5.558514
For_pct For_pct 4.689638
SC SC 3.258013
DV DV 2.748481
LSA_pct LSA_pct 1.076082
The cross validation (CV) R2 of the simplified model is 0.52 so model simplification resulted in no loss of performance.
Pruned TP boosted regression tree model
tmp <-mutate(usgs,TP =log10(TP*1000), TN =log10(TN*1000), Secchi_ft =log10(Secchi_ft)) %>%data.frame()# remove missing water color valuestmp <- tmp %>%filter(!is.na(Color_PCU))############################################################################################################ total phosphorus model########################################################################################################## initial model to optimize number of treesset.seed(2974)gbm1 <-gbm.step(data = tmp, gbm.x =c(15,27,50), gbm.y=c(46), family="gaussian", tree.complexity =5, learning.rate=0.001,bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# # # R2paste0('R2 = ', round((gbm1$self.statistics$mean.null -mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
The relative variable importance plot indicates that Color_PCU is the most important variable followed by Depth_max_ft, and then Ag_pct. The relative order of the important variables did not change.
Pruned TP model variable importance plot
# # contributions of each predictorpar(mar=c(5,8,4,0)+.1)summary(gbm1, cBars =10, las=2, cex.names=0.8, plotit =TRUE)
Log10(TP) model relative variable importance plot.
The partial dependency plot shows the model response for each variable with the other variables held constant (black line) The dashed red line is the loess smooth curve to aid in interpretation. The light gray symbols are the scatter plots of log10(TP) vs each explanatory variable. The response curves indicate the general shape, direction and potential breakpoints for each explanatory variable.
The response curves illustrate that the model response is consistent with expectations that TP would increase with lake color, percent watershed agriculture, percent watershed and shoreline residential cover and decrease with lake depth. The most pronounced responses were for water color and lake depth with water color greater than about 25 PCU associated with high TP and maximum lake depth greater than about 25 ft associated with lower TP. The response curves for the remaining variables were nearly flat consistent with the work of Kammer (2004) - that watershed development plays at most a secondary role in controlling summer surface TP in Puget Lowland lakes.
TP model partial dependency plots
library(scales)# # Plot the functions and fitted values from the model# graphics.off()gbm.plot.kc(gbm1, n.plots=3, write.title = F,plot.layout=c(3, 1),common.scale =TRUE,# smooth = TRUE,show.contrib =TRUE)
Now to fit the Secchi model to the nine potential explanatory variables. First, a map to illustrate the variability of 1970s lake Secchi depth values across the Puget Lowland of Puget Sound.
Note: I included Secchi measurements reported as “>” some value which indicated that the Secchi depth was greater than the site depth. In the future, we might consider using measures of transparency that result in right censored values (e.g., beam transmission, light extinction, others?). Other options? Removing these results excludes shallow lakes that are fairly transparent….
I followed the same approach as for TP and found the best model. The cross validation (CV) R2 of this model is 0.56. This model had only two explanatory variables - water color (Color_PCU) and maximum lake depth (Depth_max_ft). As expected higher water color was associated with shallower Secchi depths and deeper lakes were associated with deeper Secchi depths.
Secchi depth boosted regression tree model
tmp <-mutate(usgs,TP =log10(TP*1000), TN =log10(TN*1000), Secchi_ft =log10(Secchi_ft)) %>%data.frame()# remove missing water color valuestmp <- usgs %>%filter(!is.na(Color_PCU)&!is.na(Secchi_ft))############################################################################################################ Secchi depth model########################################################################################################## initial model to optimize number of treesset.seed(2974)gbm1 <-gbm.step(data = tmp, gbm.x =c(15,17,18,23,27,28,29,37,50), gbm.y=c(51), family="gaussian", tree.complexity =5, learning.rate=0.001,bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# # contributions of each predictor# summary(gbm1, cBars = 10, las=2, cex.names=0.8, plotit = TRUE)set.seed(2974)gbm1 <-gbm.step(data = tmp, gbm.x =c(15,50), gbm.y=c(51), family="gaussian", tree.complexity =5, learning.rate=0.001,bag.fraction=0.75, verbose =FALSE, silent =TRUE, plot.main =FALSE)# # play with color# tmp <- filter(usgs,!is.na(Color_PCU)&!is.na(LkShore_Emersed_max))# tmp <- mutate(tmp,Color_PCU = ifelse(Color_PCU==0,log10(Color_PCU+0.1),log10(Color_PCU)))# set.seed(2974)# gbm1 <- gbm.step(data = tmp, gbm.x = c(15,23,27,29,32,34), gbm.y=c(50), family="gaussian", tree.complexity = 5, learning.rate=0.001,# bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)# mean(gbm1$residuals^2) # mean residual difference - here 0.08# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance# gbm1$self.statistics$mean.null # here 0.149 mean total deviance# # # R2paste0('R2 = ', round((gbm1$self.statistics$mean.null -mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))
# # # note how much lower the cross validation (CV) R2 is...this model would need a lot more work#par(mar=c(5,8,4,0)+.1)summary(gbm1, cBars =10, las=2, cex.names=0.8, plotit =TRUE)
var rel.inf
Color_PCU Color_PCU 61.83451
Depth_max_ft Depth_max_ft 38.16549
The variables that best explained Summer lake surface TP in order of importance (most to least) were water color, maximum lake depth, and %Agriculture. The cross validation R-square wasn’t too bad either (CV R2 = 0.52).
The variables that best explained Summer lake Secchi depth in order of importance (most to least) were water color and maximum lake depth (CV R2 = 0.56).
This analysis suggests that we should probably be measuring water color by the platinum-cobalt unit method (not by the UV method we had been using) to better explain variation in King County small lake TP and Secchi depth.
Unfortunately, the USGS did not measure chlorophyll a. However, the more recent regional data does include chlorophyll a data…but may have fewer color measurements. Anyway, I think this exercise has been useful and helps inform what we can know about a particular lake based on it’s maximum depth and measured color.