USGS 1970s lake data

Author

Curtis DeGasperi

Published

April 3, 2025

Introduction

Over 20 years ago, Nora Kammer (2004) reported on analyses of King County Small Lakes data to address three hypotheses:

As the proportion of high density land uses in lake watersheds increase, lakes will become more eutrophic.
As the proportion of forested land uses in lake watersheds increase, lakes will become less eutrophic.
Septic systems within lake watersheds will cause lakes to become more eutrophic.

Although not explicitly stated, Nora’s approach is often described as a space-for-time approach. Nora looked for correlations between independent variables (e.g., watershed %forest) and dependent variables (trophic state indicators chlorophyll a, total phosphorus, and Secchi depth) at specific points in time.

The space-for-time approach uses spatial gradients of independent variables as a proxy for temporal changes in dependent variables. This approach is typically used because data were not collected at one or more individual locations over a period of time covering significant changes in the independent variables of interest.

Nora’s results were quite striking. High density development in a lake’s watershed was negatively correlated with lake trophic state indicators (higher density development ~ lower trophic state). and forest cover was positively correlated with trophic state indicators (higher forest cover ~ higher trophic state). A less counter intuitive finding was that lakes with sewered watersheds (i.e., few or no septic systems) tended to be less eutrophic.

I’ve been reporting elsewhere about a revisit of Nora’s analyses using more recent King County, Snohomish County, Institute of Watershed Studies (primarily Whatcom County), and National Lakes Assessment data. Here I describe a revisit of Nora’s first two questions using Puget Sound basin lake and watershed data collected by the USGS in the 1970s (Bortleson et al. 1976).

The USGS data ranged from watershed land use, lake physical characteristics, to water quality data (single summer surface ~ 1 m samples from the early 1970s). The water quality data included TP, TN (derived by summing N forms), Secchi depth, and Color (PCU), but sadly no chlorophyll a measurements. The physical lake characteristics included lake area, volume, mean and maximum depth, shoreline length, shoreline configuration, development volume, and bottom slope. The watershed data included % shoreline residential development, number of nearshore homes, % agriculture and forest land use, and the % of watershed suburban and urban residential land use. I summed the % of suburban and urban residential land use into a % residential land use since most (%81) of the % residential urban values were zero - 21% of the combined residential values were zero.

Physical lake characteristics shoreline configuration (SC) and development volume (DV) probably require further explanation. Higher values of either of these metrics indicate greater contact of sediment with shallow water. Specifically, shoreline configuration is the ratio of the length of the shoreline to the circumference of a circle having an area equal to that of the lake so a circular lake has a shoreline configuration of one. Development volume is the ratio of the mean depth to the maximum depth. Shallow lakes with a high value for development volume will have greater exposure of shallow bottom sediments.

From the data I also calculated the density of shoreline homes (number of nearshore homes divided by shoreline length; SH_dens) and the watershed area to lake area ratio (WA_LA) as potential explanatory variables. I thought the density of shoreline homes would be a better explanatory variable than just the number of homes around lakes of varying size. Watershed area to lake area ratio is thought to be an important factor regulating lake trophic state - a larger ratio indicates a potentially larger contribution of material to the lake from the watershed relative to its size, although a larger ratio also indicates a potentially larger amount of lake flushing/shorter water residence time.

Methods

It took a few days to enter the data for 362 lakes. After filtering these lakes to those in the Puget Lowland and excluding some lakes because of they weren’t small lakes (e.g., Lake Union and Portage Bay in King County), they were saline, they were mostly covered by emergent vegetation (they were really wetlands), or they were reservoirs on a larger river system. This screening resulted in 226 Puget Lowland lakes.

I used boosted regression tree (BRT) models to identify the relative importance of the selected variables that explain the variation in total phosphorus and Secchi depth (log10 transformed) in this dataset. BRT models are a class of machine learning models that combine the results of sequentially fit random forest regression trees. BRT models have been shown to have superior performance over linear modeling techniques particularly in the case of highly skewed environmental data.

I selected nine potential explanatory variables - maximum lake depth (Depth_max_ft), shoreline configuration (SC), development volume (DV), % shoreline residential development (Res_sl_pct), % watershed agriculture (Ag_pct), % watershed forest (For_pct), watershed % residential development (Res_pct), % of watershed covered by lake area (LSA_pct), and water color (Color_PCU). I plotted a Spearman rank correlation matrix to evaluate colinearity among these variables and did some preliminary BRT model testing to land on what I believed was the most parsimonious approach to model development. Model development generally followed methods outlined by Waite et al. (2019). Model development started with an initial full model followed by model simplification steps that resulted in a final best model for log10(TP) and log10(Secchi).

code to load and process the lake data

library(tidyverse)
library(lubridate)
library(sf)
library(mapview)
### gbm + dismo - boosted regression tree packages
library(gbm)
library(dismo)

library(conflicted)
conflict_prefer("filter", "dplyr", quiet = T)
conflict_prefer("select", "dplyr", quiet = T)

library(readxl)

source('gbm.plot.kc.R')

### get PL ecoregion boundary....
ecoregion <- st_read("C:/Users/degaspec/OneDrive - King County/gis_external/EPA_Region10_ecoregions/reg10_eco_l3.shp",quiet = TRUE) %>%
  st_transform(4269)
# st_crs(ecoregion)
pugetl <- ecoregion %>% filter(US_L3NAME == "Puget Lowland") %>% dplyr::select(US_L3NAME) %>% 
  st_transform(4269)

usgs.p <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/bortleson_usgs_stats_1976.xlsx",'data', na = "NA") %>% 
  mutate(lat = LAT_D + LAT_M/60 + LAT_S/60/60, lon = -(LON_D + LON_M/60 + LON_S/60/60), Res_pct = Res_urb_pct+Res_sub_pct)

# remove > for Secchi.....consider replacing with NADA method in future
usgs.wq <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/bortleson_usgs_stats_1976.xlsx",'water quality') %>% filter(Depth_ft == 3) %>%
  filter(SampleSite==1) %>% # to remove second Fawn Lake sample site
  # mutate(TN = TNO2+TNO3+TNH3+TON, Secchi_ft = as.numeric(gsub(">",'',Secchi_ft)))
  rowwise() %>% 
  mutate(TN = sum(TNO2,TNO3,TNH3,TON,na.rm=TRUE), Secchi_ft = as.numeric(gsub(">",'',Secchi_ft)))
# remove < for FC_mean.....consider replacing with NADA method in future
usgs.fc <- read_excel("C:/Users/degaspec/OneDrive - King County/R/NLA/bortleson_usgs_stats_1976.xlsx",'fecal coliform') %>% 
  mutate(Date_fc = Date, Time_fc = Time, FC_mean = as.numeric(gsub(">",'',FC_mean))) %>% select(-SampleSite,-Date,-Time)
  

# Loop has no wq data
usgs <- left_join(usgs.p,usgs.wq,by=c("County","Lake")) 
# presume drainage area includes lake surface as implied by the provided lake surface percent definition and values consistent with lake area/drainage area
usgs <- left_join(usgs,usgs.fc,by=c("County","Lake")) %>% filter(!Lake %in% c('Loop','Portage Bay','Union')) %>% 
  mutate(Inflow = ifelse(Lake=="Fenwick","Perennial",Inflow), SH_dens = Homes_ns/SL_mi, WA_LA = (DA_sqmi*640)/LA_ac, Ag = factor(ifelse(Ag_pct==0,0,1)),
         Inflow = factor(Inflow), Outflow_channel = factor(Outflow_channel))

# and remove 2 lakes in Bellevue with crazy high TP - Sturtevant and Larsen
# usgs <- usgs %>% filter(!Lake %in% c('Sturtevant','Larsen'))

# and remove Lake Tapps because it doesn't have any land use data and Mud Mountain, LaGrande, and Cushman because they are reservoirs with most of drainage high elevation...not lowland lakes
usgs <- usgs %>% filter(!Lake %in% c('Tapps','Mud Mountain','LaGrande','Cushman','Cushman (Lower)'))

# and remove Crockett (Island Co) because it is hypersaline and Tennant (Whatcom) because it looks more like a wetland than a lake and Shannon (Skagit) because it is a hydropower reservoir....'Unnamed (Island) lakes and Kai Tai (Jefferson) also hypersaline lakes
usgs <- usgs %>% filter(!Lake %in% c('Shannon','Tennant','Crockett','Unnamed (31N-1E-6)','Unnamed (33N-2E-7)','Kah Tai'))
                        
# WHAT to do about Samish? (2 basins) only use largest?                        
usgs <- usgs %>% filter(!Lake %in% c('Samish (West Arm)','Samish (East Arm)'))

# remove Capitol Lake...Percival Cove 
usgs <- usgs %>% filter(!Lake %in% c('Capitol','Percival Cove','Patterson (North Arm)'))

### intersect puget lowland shape with sampling points
usgs <- st_as_sf(usgs, coords = c("lon", "lat"), remove = FALSE)
st_crs(usgs) <- 4269
usgs <- st_intersection(usgs,pugetl) %>% st_drop_geometry() %>% data.frame()

###################################################################################

Results

Let’s start by acknowledging that LSA_pct is correlated strongly with WA_LA (R=-0.95) and WA_LA is highly left skewed while LSA_pct is less so (not shown). The relative proportion of the lake surface to watershed area (LSA_pct) is considered to be an important determinant of lake trophic state so I’ll keep LSA_pct.

Now I’ll consider the eleven remaining candidate explanatory variables with R >0.6 (see the correlation matrix plot below). Moving left to right Slope_pct and Depth_max_ft are correlated (R=0.76) and both are left skewed. I’ll keep Depth_max_ft since it has been shown to be important explanatory variable in this type of analysis before and drop slope_pct. Not surprisingly, Res_sl_pct is very strongly correlated with SH_dens (R=0.96), but these two variables are also correlated with Res_pct (R=0.75 and 0.77, respectively). Correlation across different scales is a persistent problem when trying to evaluate effects of scale (e.g., local vs watershed) - the same metrics at local and watershed scale are often highly correlated. I’ll drop SH_dens because it has a more skewed distribution than Res_sl_pct. Res_pct is correlated with For_pct (R=-0.62). Regardless, I’m going to keep For_pct and Res_pct and look at their interactions in the model.

I suspect color might also be a potentially important variable so I will include it. After this initial screening of candidate variables, I have nine potential explanatory variables that I can test in my BRT models. Unfortunately seven lakes did not have associated surface water color measurements so the number of independent lakes available to model was 209.

Spearman rank correlation matrix

###################################################################################
### BRT modeling
# https://blog.devgenius.io/r-tutorial-boosted-regression-trees-9f243d88a921
# https://rspatial.org/raster/sdm/9_sdm_brt.html

library("PerformanceAnalytics")
chart.Correlation(usgs[c(15,17,18,19,23,27:29,37,50,60)], histogram=TRUE, pch=19, method = 'spearman')

Spearman rank correlation matrix

# Based on correlations keep SH_dens and drop Res_sl_pct and SH_dens. Drop WA_LA and keep LSA_pct. 
# Keep Res_pct even though it is correlated with For_pct.
# Drop slope_pct and keep Depth_max_ft. Drop SC and keep DV.
# also replace Ag_pct with a factor for Ag = 0 = 0 or Ag > 0 = 1 (presence absence of agriculture in watershed)
# Final 6: Depth_max_ft (14), DV (17), For_pct (27), LSA_pct (28), Res_pct (32), Ag presence/absence (61)

Total Phosphorus model

The next step is to fit the TP model to the nine potential explanatory variables. First, a map to illustrate the variability of 1970s lake surface TP values across the Puget Lowland of Puget Sound.

Lake summer TP map

tmp <- usgs  %>% filter(!is.na(Color_PCU))

mapview(tmp %>% mutate(TP = TP*1000), zcol = 'TP', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = 'TP Conc (µg/L)')  +
  mapview(pugetl, alpha.regions = 0.2)

A map to illustrate the variability of 1970s lake surface color might also be of interest.

Lake summer surface color map

tmp <- usgs  %>% filter(!is.na(Color_PCU))

mapview(tmp, zcol = 'Color_PCU', xcol = "lon", ycol = "lat", crs = 4269, grid = FALSE, layer.name = 'Color (PCU)')  +
  mapview(pugetl, alpha.regions = 0.2)

The cross validation (CV) R2 of this model is 0.50. Not too shabby. This is comparable to CV R2s reported by Read et al. (2015) in a similar study using lake data for the contiguous U.S. Note that the CV R2 is preferred over the conventional R2 because it is more conservative. In this case it was based on holding 25% (bag fraction) of the lakes out for each regression tree split. The withheld sites are then used to test the percentage of deviance explained by the split.

TP boosted regression tree model

tmp <- mutate(usgs,TP = log10(TP*1000), TN = log10(TN*1000), Secchi_ft = log10(Secchi_ft)) %>% data.frame()

# remove missing water color values
tmp <- tmp  %>% filter(!is.na(Color_PCU))

#########################################################################################################
### total phosphorus model
###################################################################################
######################

# initial model to optimize number of trees
set.seed(2974)
gbm1 <- gbm.step(data = tmp, gbm.x = c(15,17,18,23,27,28,29,37,50), gbm.y=c(46), family="gaussian", tree.complexity = 5, learning.rate=0.001,
                 bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)

# mean(gbm1$residuals^2) # mean residual difference - here 0.08
# gbm1$cv.statistics$deviance.mean # here 0.138 estimated cv deviance
# gbm1$self.statistics$mean.null # here 0.149 mean total deviance

# # # R2
 paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))

[1] "R2 = 0.72"

TP boosted regression tree model

# # 
# # # CV R2
 paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))

[1] "Cross validation (CV) R2 = 0.5"

TP boosted regression tree model

# 
# # note how much lower the cross validation (CV) R2 is...this model would need a lot more work
#

The relative variable importance plot indicates that Color_PCU is the most important variable followed by Depth_max_ft, Ag_pct, Res_sl_pct, Res_pct, For_pct, SC, DV, and LSA_pct in descending order. I think we should simply prune this model by removing the variables with relative importance less than 6.

initial TP model variable importance plot

# # contributions of each predictor
 par(mar=c(5,8,4,0)+.1)
 summary(gbm1, cBars = 10, las=2, cex.names=0.8, plotit = TRUE)

Log10(TP) model relative variable importance plot.

                      var   rel.inf
Color_PCU       Color_PCU 39.028103
Depth_max_ft Depth_max_ft 28.462399
Ag_pct             Ag_pct  9.405917
Res_sl_pct     Res_sl_pct  5.772853
Res_pct           Res_pct  5.558514
For_pct           For_pct  4.689638
SC                     SC  3.258013
DV                     DV  2.748481
LSA_pct           LSA_pct  1.076082

The cross validation (CV) R2 of the simplified model is 0.52 so model simplification resulted in no loss of performance.

Pruned TP boosted regression tree model

tmp <- mutate(usgs,TP = log10(TP*1000), TN = log10(TN*1000), Secchi_ft = log10(Secchi_ft)) %>% data.frame()

# remove missing water color values
tmp <- tmp  %>% filter(!is.na(Color_PCU))

#########################################################################################################
### total phosphorus model
###################################################################################
######################

# initial model to optimize number of trees
set.seed(2974)
gbm1 <- gbm.step(data = tmp, gbm.x = c(15,27,50), gbm.y=c(46), family="gaussian", tree.complexity = 5, learning.rate=0.001,
                 bag.fraction=0.75, verbose = FALSE, silent = TRUE, plot.main = FALSE)

# # # R2
 paste0('R2 = ', round((gbm1$self.statistics$mean.null - mean(gbm1$residuals^2) )/gbm1$self.statistics$mean.null,2))

[1] "R2 = 0.71"

Pruned TP boosted regression tree model

# # 
# # # CV R2
 paste0('Cross validation (CV) R2 = ', round((gbm1$self.statistics$mean.null - gbm1$cv.statistics$deviance.mean)/gbm1$self.statistics$mean.null,2))

[1] "Cross validation (CV) R2 = 0.52"

Pruned TP boosted regression tree model

The relative variable importance plot indicates that Color_PCU is the most important variable followed by Depth_max_ft, and then Ag_pct. The relative order of the important variables did not change.

Pruned TP model variable importance plot

# # contributions of each predictor
 par(mar=c(5,8,4,0)+.1)
 summary(gbm1, cBars = 10, las=2, cex.names=0.8, plotit = TRUE)

                      var  rel.inf
Color_PCU       Color_PCU 45.59221
Depth_max_ft Depth_max_ft 39.39114
Ag_pct             Ag_pct 15.01665

The partial dependency plot shows the model response for each variable with the other variables held constant (black line) The dashed red line is the loess smooth curve to aid in interpretation. The light gray symbols are the scatter plots of log10(TP) vs each explanatory variable. The response curves indicate the general shape, direction and potential breakpoints for each explanatory variable.

The response curves illustrate that the model response is consistent with expectations that TP would increase with lake color, percent watershed agriculture, percent watershed and shoreline residential cover and decrease with lake depth. The most pronounced responses were for water color and lake depth with water color greater than about 25 PCU associated with high TP and maximum lake depth greater than about 25 ft associated with lower TP. The response curves for the remaining variables were nearly flat consistent with the work of Kammer (2004) - that watershed development plays at most a secondary role in controlling summer surface TP in Puget Lowland lakes.

TP model partial dependency plots

library(scales)

# # Plot the functions and fitted values from the model
# graphics.off()
gbm.plot.kc(gbm1, n.plots=3, write.title = F,
        plot.layout=c(3, 1),
        common.scale = TRUE,
        # smooth = TRUE,
        show.contrib = TRUE)