WQI vs. Trout Biomass in PA

This document is intended to serve 2 purposes:

An example of a data exercise using tidy coding workflow, including piping and featuring dplyr functionality
Use real data to investigate the response of trout biomass (by species and overall) to water quality (PADEP WQI)

First I’ll import data from excel documents in my working directory.

## read in Class A biomass df from MB
classAbiomass <- read_excel('classAbiomass.xlsx', sheet = 'Sheet1')
classAbiomass <- janitor::clean_names(classAbiomass) #clean column names

For spreadsheets with problematic column headers, such as spaces, special characters, capitalization, etc., the ‘janitor’ package is useful. Also, the read_excel() function imports data as a tibble, which has numerous advantages over traditional dataframes.

county	water	section	fishery	section_limits	lower_limit_latitude	lower_limit_longitude	length_miles	date_added_to_class_a_list	wild_brook_trout_biomass	wild_brown_trout_biomass	wild_rainbow_trout_biomass
Somerset	South Fork Bens Creek	4	Brown	30 m downstream private bridge off T-590 to T-785 bridge	40.25059	-78.97206	3.51	July 2019	NA	44.03	NA
McKean	UNT to Lewis Run (RM 3.20)	1	Brown	Headwaters to Mouth	41.84286	-78.69338	2.68	April 2019	1.29	40.62	NA
Clearfield/ Elk	Grapevine Run	1	Brook	Headwaters to Mouth	41.20558	-78.66495	0.76	April 2019	40.47	NA	NA

After looking at the dataframe, I identified numerous issues which I dealt with in one code chunk.

Saved the result as a new object (classA_df) to move forward with and preserve the original
Calculated total biomass (sum of all species) using rowSums(), while dealing with missing values (NAs)
Fixed an incorrect spelling of February with str_replace() (stringr package)
Replaced errant longitude with correct value (someone forgot to add the ‘-’)
Leveraged the lubridate package to get the dates into a readable format. This is a powerful package for dealing with annoying date issues
Removed a record with an obviously incorrect longtiude (-8.5)
Dropped the old date column since it’s no longer necessary

classA_df <- 
  classAbiomass %>% 
  
  mutate(
    #calculate total biomass
    total_biomass = rowSums(dplyr::select(
      ., wild_brook_trout_biomass, wild_brown_trout_biomass, wild_rainbow_trout_biomass), na.rm=T),
    
    date_correct = str_replace(date_added_to_class_a_list, "Febuary", "February"), # use str_replace to correct Febuary

    lower_limit_longitude = replace(lower_limit_longitude, lower_limit_longitude > 77.71, -77.71111), # replace errant longitude
    
    date = ymd(parse_date_time(date_correct, orders="bY")), # use parse_date to get into date format
    month = month(date), # extract month
    year = year(date) # extract year
  ) %>% 
  
  #remove another errant longitude that cannot be understood
  filter(!(lower_limit_longitude > -10)) %>% 
  
  #drop unnecesary columns
  select(-date_added_to_class_a_list)

The result is a clean dataframe

county	water	section	fishery	section_limits	lower_limit_latitude	lower_limit_longitude	length_miles	wild_brook_trout_biomass	wild_brown_trout_biomass	wild_rainbow_trout_biomass	total_biomass	date_correct	date	month	year
Somerset	South Fork Bens Creek	4	Brown	30 m downstream private bridge off T-590 to T-785 bridge	40.25059	-78.97206	3.51	NA	44.03	NA	44.03	July 2019	2019-07-01	7	2019
McKean	UNT to Lewis Run (RM 3.20)	1	Brown	Headwaters to Mouth	41.84286	-78.69338	2.68	1.29	40.62	NA	41.91	April 2019	2019-04-01	4	2019
Clearfield/ Elk	Grapevine Run	1	Brook	Headwaters to Mouth	41.20558	-78.66495	0.76	40.47	NA	NA	40.47	April 2019	2019-04-01	4	2019

Since we now have a clean class A dataframe with lat/long, we can create an interactive map using the leaflet() package. This is a very simple process, but allows you to leverage basemaps and display info with pop-up labels that display while hovering/clicking. You can also add data filters via radio buttons that help to display categorical information spatially. Another advantage is these maps are easily embedded in Rmarkdown documents that can be published to the web, thus no need for IT involvement.

# leaflet map with radio buttons to toggle between species

#define legend categories
bins <- c(0, 20, 40, 60, 100, 200, Inf)
pal_bin <- colorBin("viridis", domain = classA_df$total_biomass, bins = bins, na.color = 'transparent')
# setting na.color = 'transparent' causes sites where no biomass data is present to become invisible

biomassBySpecies_map <-
  classA_df %>%
  
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% 
  
  #species
  
  #brook
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(wild_brook_trout_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(wild_brook_trout_biomass, 2)), group = 'Brook') %>%
  
  #brown
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(wild_brown_trout_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(wild_brown_trout_biomass, 2)), group = 'Brown') %>%
  
  #rainbow
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(wild_rainbow_trout_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(wild_rainbow_trout_biomass, 2)), group = 'Rainbow') %>%
  
  #total
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(total_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(total_biomass, 2)), group = 'Total') %>%
  
  #legend -- for bin palette
  addLegend("topright", pal = pal_bin, values = ~total_biomass,
            title = "Trout Biomass",
            opacity = 0.8) %>% 
  
  
  # Layers control
  addLayersControl(
    overlayGroups = c("Brook", 'Brown', 'Rainbow', "Total"),
    options = layersControlOptions(collapsed = FALSE)
  )
  
biomassBySpecies_map

Now, since our intention is to examine the relationship of trout biomass and water quality, lets bring in the WQI dataset from TW (2008-2018). We again use read_excel() and janitor to clean column names.

wqi <- read_excel('WQI_2018.xlsx', sheet = 'Sheet1')
wqi <- janitor::clean_names(wqi) #clean column names

I want to show WQI data spatially with a leaflet map too, but first we have to deal with the classifications. These categorical ‘grades’ are problematic, so it’s best to make an additional column as a factor and set levels.

## make grade a factor variable for use with symbology below
wqi <- 
  wqi %>% 
  mutate(f_grade = factor(grade, levels = c("Poor", "Fair", "Average", "Good"))) 

levels(wqi$f_grade)

## [1] "Poor"    "Fair"    "Average" "Good"

Now we can code our leaflet map (just showing wqi, no swqi’s)

## define color palette
factpal <- colorFactor(c('#d7191c', '#fdae61', '#abdda4', '#2b83ba'), wqi$f_grade)

# overall WQI only
wqi_map <-
  wqi %>%
  
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% 
  addCircleMarkers(
    col = ~factpal(f_grade),
    opacity = 0.8,
    popup = ~ as.character(round(wqi, 2)), label = ~ flowline_gnis) %>%
  addLegend(
    colors=c('#d7191c', '#fdae61', '#abdda4', '#2b83ba'),
    labels=c('Poor', 'Fair', 'Average', 'Good'),
    opacity = 0.8,
    title = 'DEP 2018 WQI Scores')

wqi_map

Now that we’ve visualized trout biomass and wqi throughout space, the next step is to export our clean dataframes to excel so we can merge with HUC12 watershed in GIS. This is an annoying step that requires multiple steps outside of the Rstudio project.

For example, it’s a pain to export to excel, import to GIS, spatial join, save as shapefile, open .dbf in excel, save, and re-import to R. Plus, I’m learning that DEP’s GIS only accepts .xls – while R exports as .xlsx. Need to find a work-around to eliminate yet another intermediate step.

For the long-term we need to find way to work with polygons/perfom GIS joins/relates in R. Maybe ArcPro would provide some less painful alternatives?

For the GIS task itself, you can actually build a spreadsheet [write_xlsx()] using dataframes in R. See example below using biomass and wqi dataframes. This helps prevent a bunch of individual csv files in your directory – too many and naming conventions become an issue.

## I don't want to take the entire dataframes out to merge with HUC12, GIS does strange things to numbers
## see if first 7 columns have enough unique info to complete merge

dim(classA_df) # 962  16

classA_df_merge <- 
  classA_df %>% 
  select(1:7) %>% # select only first 7 cols (includes lat/long)
  distinct() # determine how many unique combos of 1st 6 cols are present (hopefully 962!)

classA_df_merge
dim(classA_df_merge) # 962   7 -- check!


## same for wqi df

dim(wqi) #2625   17

wqi_merge <- 
  wqi %>% 
  select(1:6, 15:16) %>% # select only first 6 cols plus lat/long for projecting
  distinct() # determine how many unique combos of 1st 6 cols are present (hopefully 2625!)

wqi_merge
dim(wqi_merge) #2625    8 -- check!

# build spreadsheet/export dataframes as single excel sheet with 2 tabs for GIS merge with HUC12


write_xlsx(list(classA_df_merge = classA_df_merge, wqi_merge = wqi_merge), "gis_huc12.xlsx")

THe GIS merge happens in a black box outside of R.

I simply projected both dataframes as shapefiles, and merged with HUC12 polygon layer

Now we re-import the result as we did above

However, GIS chops off column names if they’re too long so we need to deal with that to merge with the original df to access the numeric fields of interest. Using piping we can read-in data, subset columns, rename columns, and join with original df all in one step.

classA_huc12_join <- read_excel('classA_huc12_join.xlsx', sheet = 'classA_huc12_join') %>% 
  
  select(2:6, 16:21) %>% # select only columns of interest
  
  rename(section_limits = section_li) %>% #rename columns that gis cut off
  
  left_join(classA_df, by=c("county", "water", "section", "fishery", "section_limits"))

We confirm that dfs merged successfully due to helpful messages from the tidylog package. When loaded after tidyverse this package gives narrations on what each function accomplished.

county	water	section	fishery	section_limits	HUC_8	HUC_8_NAME	HUC_10	HUC_10_NAM	HUC_12	HUC_12_NAM	lower_limit_latitude	lower_limit_longitude	length_miles	wild_brook_trout_biomass	wild_brown_trout_biomass	wild_rainbow_trout_biomass	total_biomass	date_correct	date	month	year
Lancaster	Segloch Run	2	Brook	Intersection of T 596 and T 548 to SR026	02050306	Lower Susquehanna	0205030609	Cocalico Creek	020503060902	Middle Creek	40.23230	-76.27900	2.60	27.35	29.95	NA	57.30	January 1983	1983-01-01	1	1983
Bedford-Blair	Boiling Spring Run	2	Brown	Dam at pond immediately downstream from Harvest Lane to Mouth	02050302	Upper Juniata	0205030201	Upper Frankstown Branch Juniata River	020503020101	Beaverdam Creek	40.27194	-78.46639	2.11	NA	56.21	NA	56.21	January 2017	2017-01-01	1	2017
Perry	Kansas Valley Run	1	Mixed Brook/Brown	Headwaters to Mouth	02050304	Lower Juniata	0205030409	Tuscarora Creek	020503040903	Horse Valley Run	40.34373	-77.59420	2.48	23.21	31.96	NA	55.17	March 2001	2001-03-01	3	2001

Same for the WQI - HUC12 df

wqi_huc12_join <- read_excel('wqi_huc12_join.xlsx', sheet = 'wqi_huc12_join') %>% 
  
  select(2:7, 17:22) %>% # select only columns of interest
  
  rename(nhd_location_id = nhd_locati, monitoring_alias = monitoring, 
         flowline_gnis = flowline_g, year_collected = year_colle) %>% #rename columns that gis cut off
  
  left_join(wqi, by=c("nhd_location_id", "monitori_1", "monitoring_alias", "sample_loc", "flowline_gnis", "year_collected"))

nhd_location_id	monitori_1	monitoring_alias	sample_loc	flowline_gnis	year_collected	HUC_8	HUC_8_NAME	HUC_10	HUC_10_NAM	HUC_12	HUC_12_NAM	wqi	wq_similar	forest	urban	ag	land_dist	grade	n	latitude	longitude	f_grade
1100734	NA	NA	Cedar Run at culvert outlet downstream of Waste Management	NA	18	02050305	Lower Susquehanna-Swatara	0205030505	Yellow Breeches Creek	020503050505	Lower Yellow Breeches Creek	30.33704	Ag	33.33339	37.79162	30.33704	39.21016	Poor	1	40.23016	-76.95059	Poor
36455	NA	WQN0212	OLD US RT 11; BRIDGE ST BR @ NEW CUMBERLAND	Yellow Breeches Creek	8	02050305	Lower Susquehanna-Swatara	0205030505	Yellow Breeches Creek	020503050505	Lower Yellow Breeches Creek	49.34103	Forest	49.34103	50.26803	53.29568	76.61736	Fair	16	40.22430	-76.86069	Fair
36455	NA	WQN0212	OLD US RT 11; BRIDGE ST BR @ NEW CUMBERLAND	Yellow Breeches Creek	9	02050305	Lower Susquehanna-Swatara	0205030505	Yellow Breeches Creek	020503050505	Lower Yellow Breeches Creek	56.38481	Ag	57.43163	60.63640	56.38481	76.58827	Fair	19	40.22430	-76.86069	Fair

Now after all that wrangling, the analysis step becomes possible.

The analysis routine consists of averaging wqi in each huc12 watershed since we could potentially have multiple wqi points per huc 12. The resulting characterization of water quality in each watershed can serve as a predictor variable for trout biomass.

wqi_huc12_stat <- 
  wqi_huc12_join %>% 
  
  group_by(HUC_12, HUC_12_NAM) %>% # summarize by these fields
  
  summarize(mean_wqi = mean(wqi, na.rm=T), sd_wqi = sd(wqi, na.rm=T), n_wqi = length(wqi), # calculate stats
            minYear = min(year_collected, na.rm=T), maxYear = max(year_collected, na.rm=T)) 
# could keep going to calculate mean swqi scores

444 huc12s have wqi points within. Range of n is 1-37.

HUC_12	HUC_12_NAM	mean_wqi	sd_wqi	n_wqi	mean_forest_wqi	sd_forest_wqi	mean_urban_wqi	sd_urban_wqi	mean_ag_wqi	sd_ag_wqi	mean_land_dist_wqi	sd_land_dist_wqi	minYear	maxYear
020401010402	Lower Equinunk Creek	93.47634	NA	1	97.07617	NA	98.40826	NA	93.51108	NA	93.47634	NA	18	18
020401010605	Masthope Creek	82.67053	NA	1	96.03777	NA	96.08027	NA	95.36778	NA	82.67053	NA	16	16
020401030201	East Branch Dyberry Creek	91.91476	1.170703	2	96.94153	2.389315	98.16760	1.569052	93.56753	3.508076	94.57599	2.592843	10	18

So water quality has been characterized (mean, sd wqi) for 444 huc12s

Now join with classA df using the HUC12 field to investigate how responsive trout biomass is to WQ

classA_wqi_df <- 
  classA_huc12_join %>% 
  
  inner_join(wqi_huc12_stat, by=c('HUC_12', 'HUC_12_NAM')) ## using inner_join so that only matching x & y records are returned

The merge resulted in 137 unique huc12s with water quality AND trout biomass data - decent n

Let’s plot the result, by species to visualize relationship between wqi (x) and trout biomass (y)

To accomplish, we need to melt and facet_wrap in ggplot2 to show x - y relationships. Piping like this - including melting and plotting results in no objects created in your local environment!

classA_wqi_df %>% 
  
  reshape2::melt(id.vars = c('HUC_12', 'HUC_12_NAM', 'mean_wqi', 'sd_wqi'),
                 measure.vars = c("wild_brook_trout_biomass", "wild_brown_trout_biomass", 
                                  "wild_rainbow_trout_biomass","total_biomass")) %>% 
  
  drop_na(value) %>% #drop the rows with NA value for trout biomass
  
  ggplot(aes(x=mean_wqi, y=value, color = variable, fill = variable)) +
  facet_wrap(~variable, ncol = 2, scales = 'free_y') +
  geom_point(pch=21, alpha = 0.7) +
  geom_smooth(method = lm) + 
  labs(x = 'Mean HUC12 WQI', y = 'Trout Biomass (kg/ha)') +
  theme_minimal() +
  theme(legend.position = 'none')

If we want to save high quality image file of plot, use ggsave() to manipulate file type, dpi, and dimensions

ggsave(file='WQI_vs_TroutBiomass.png', scale=1, units='in', dpi=300, width=10, height=6)

Here is a better representation of each species’ relationship to the overall WQI, and individual SWQI’s.

Subsequent regression modeling of brook and brown trout vs. water quality is revealing. Rainbow trout response was not modeled due to low sample size (n=14).

Species	WQI Type	Intercept	Slope	P	R2_adjusted	n	Result
Brook Trout	Agriculture SWQI	14.81 +/- 7.9	0.29 +/- 0.09	0.002	2.88%	307	Positive
Brook Trout	Forest SWQI	14.78 +/- 9.09	0.29 +/- 0.1	0.006	2.11%	307	Positive
Brook Trout	Land Disturbance SWQI	53.1 +/- 8.12	-0.16 +/- 0.1	0.092	0.60%	307	No Trend
Brook Trout	Urban SWQI	16.35 +/- 10.07	0.27 +/- 0.11	0.021	1.42%	307	Positive
Brook Trout	WQI	33.17 +/- 6.45	0.08 +/- 0.08	0.312	0.01%	307	No Trend
Brown Trout	Agriculture SWQI	170.61 +/- 13.01	-1.5 +/- 0.17	0.000	26.38%	221	Negative
Brown Trout	Forest SWQI	191.33 +/- 15.1	-1.7 +/- 0.19	0.000	26.78%	221	Negative
Brown Trout	Land Disturbance SWQI	191.74 +/- 51.42	-1.54 +/- 0.6	0.010	2.53%	221	Negative
Brown Trout	Urban SWQI	213.08 +/- 17.54	-1.92 +/- 0.21	0.000	26.58%	221	Negative
Brown Trout	WQI	174.98 +/- 13.87	-1.65 +/- 0.19	0.000	25.20%	221	Negative

Brook trout biomass is significantly, positively correlated with agriculture, forest, and urban SWQIs. For example, for every 3 point increase in the agriculture SWQI score (improving water quality), brook trout biomass increased by ~ 1 kg/ha. This suggests that brook trout are responding positively to increases in water quality spatially across PA. Although these relationships are positive and significant, the models describe very little variation in the data (0 - 2.88%). The most likely interpretation is that there are many other stressors involved in controlling brook trout abundance such as habitat degradation and population fragmentation. Water quality has a role in regulating brook trout abundance, but it seems to be a minor factor.

Conversely, brown trout are negatively correlated with all SWQI scores and the overall WQI score. The slopes (effect sizes) are much greater for browns compared to brooks. For example, for every 1 point increase in overall WQI score, brown trout biomass decreases by 1.65 kg/ha. Additionally, the amount of variation described by brown trout models is much increased (25 - 27% for overall WQI and all SWQIs except land disturbance). Taken together, the magnitude and directionality of the slope estimates suggest that water quality is not playing a large role in regulating brown trout populations.

A shortcoming of this analysis may be that mean wqi scores in huc12 watersheds are not representative of on-site water quality. However, huc12s are small enough that it shouldn’t be a major issue.

I hope this exercise included some useful code while informing a current topic of discussion.