This document is intended to serve 2 purposes:

  1. An example of a data exercise using tidy coding workflow, including piping and featuring dplyr functionality
  2. Use real data to investigate the response of trout biomass (by species and overall) to water quality (PADEP WQI)


First I’ll import data from excel documents in my working directory.

## read in Class A biomass df from MB
classAbiomass <- read_excel('classAbiomass.xlsx', sheet = 'Sheet1')
classAbiomass <- janitor::clean_names(classAbiomass) #clean column names


For spreadsheets with problematic column headers, such as spaces, special characters, capitalization, etc., the ‘janitor’ package is useful. Also, the read_excel() function imports data as a tibble, which has numerous advantages over traditional dataframes.

county water section fishery section_limits lower_limit_latitude lower_limit_longitude length_miles date_added_to_class_a_list wild_brook_trout_biomass wild_brown_trout_biomass wild_rainbow_trout_biomass
Somerset South Fork Bens Creek 4 Brown 30 m downstream private bridge off T-590 to T-785 bridge 40.25059 -78.97206 3.51 July 2019 NA 44.03 NA
McKean UNT to Lewis Run (RM 3.20) 1 Brown Headwaters to Mouth 41.84286 -78.69338 2.68 April 2019 1.29 40.62 NA
Clearfield/ Elk Grapevine Run 1 Brook Headwaters to Mouth 41.20558 -78.66495 0.76 April 2019 40.47 NA NA


After looking at the dataframe, I identified numerous issues which I dealt with in one code chunk.

  1. Saved the result as a new object (classA_df) to move forward with and preserve the original
  2. Calculated total biomass (sum of all species) using rowSums(), while dealing with missing values (NAs)
  3. Fixed an incorrect spelling of February with str_replace() (stringr package)
  4. Replaced errant longitude with correct value (someone forgot to add the ‘-’)
  5. Leveraged the lubridate package to get the dates into a readable format. This is a powerful package for dealing with annoying date issues
  6. Removed a record with an obviously incorrect longtiude (-8.5)
  7. Dropped the old date column since it’s no longer necessary
classA_df <- 
  classAbiomass %>% 
  
  mutate(
    #calculate total biomass
    total_biomass = rowSums(dplyr::select(
      ., wild_brook_trout_biomass, wild_brown_trout_biomass, wild_rainbow_trout_biomass), na.rm=T),
    
    date_correct = str_replace(date_added_to_class_a_list, "Febuary", "February"), # use str_replace to correct Febuary

    lower_limit_longitude = replace(lower_limit_longitude, lower_limit_longitude > 77.71, -77.71111), # replace errant longitude
    
    date = ymd(parse_date_time(date_correct, orders="bY")), # use parse_date to get into date format
    month = month(date), # extract month
    year = year(date) # extract year
  ) %>% 
  
  #remove another errant longitude that cannot be understood
  filter(!(lower_limit_longitude > -10)) %>% 
  
  #drop unnecesary columns
  select(-date_added_to_class_a_list)


The result is a clean dataframe

county water section fishery section_limits lower_limit_latitude lower_limit_longitude length_miles wild_brook_trout_biomass wild_brown_trout_biomass wild_rainbow_trout_biomass total_biomass date_correct date month year
Somerset South Fork Bens Creek 4 Brown 30 m downstream private bridge off T-590 to T-785 bridge 40.25059 -78.97206 3.51 NA 44.03 NA 44.03 July 2019 2019-07-01 7 2019
McKean UNT to Lewis Run (RM 3.20) 1 Brown Headwaters to Mouth 41.84286 -78.69338 2.68 1.29 40.62 NA 41.91 April 2019 2019-04-01 4 2019
Clearfield/ Elk Grapevine Run 1 Brook Headwaters to Mouth 41.20558 -78.66495 0.76 40.47 NA NA 40.47 April 2019 2019-04-01 4 2019


Since we now have a clean class A dataframe with lat/long, we can create an interactive map using the leaflet() package. This is a very simple process, but allows you to leverage basemaps and display info with pop-up labels that display while hovering/clicking. You can also add data filters via radio buttons that help to display categorical information spatially. Another advantage is these maps are easily embedded in Rmarkdown documents that can be published to the web, thus no need for IT involvement.

# leaflet map with radio buttons to toggle between species

#define legend categories
bins <- c(0, 20, 40, 60, 100, 200, Inf)
pal_bin <- colorBin("viridis", domain = classA_df$total_biomass, bins = bins, na.color = 'transparent')
# setting na.color = 'transparent' causes sites where no biomass data is present to become invisible

biomassBySpecies_map <-
  classA_df %>%
  
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% 
  
  #species
  
  #brook
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(wild_brook_trout_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(wild_brook_trout_biomass, 2)), group = 'Brook') %>%
  
  #brown
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(wild_brown_trout_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(wild_brown_trout_biomass, 2)), group = 'Brown') %>%
  
  #rainbow
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(wild_rainbow_trout_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(wild_rainbow_trout_biomass, 2)), group = 'Rainbow') %>%
  
  #total
  addCircleMarkers(lng = ~lower_limit_longitude, lat = ~lower_limit_latitude,
                   col = ~pal_bin(total_biomass),
                   opacity = 0.8, popup = ~ water, 
                   label = ~ as.character(round(total_biomass, 2)), group = 'Total') %>%
  
  #legend -- for bin palette
  addLegend("topright", pal = pal_bin, values = ~total_biomass,
            title = "Trout Biomass",
            opacity = 0.8) %>% 
  
  
  # Layers control
  addLayersControl(
    overlayGroups = c("Brook", 'Brown', 'Rainbow', "Total"),
    options = layersControlOptions(collapsed = FALSE)
  )
  
biomassBySpecies_map


Now, since our intention is to examine the relationship of trout biomass and water quality, lets bring in the WQI dataset from TW (2008-2018). We again use read_excel() and janitor to clean column names.

wqi <- read_excel('WQI_2018.xlsx', sheet = 'Sheet1')
wqi <- janitor::clean_names(wqi) #clean column names


I want to show WQI data spatially with a leaflet map too, but first we have to deal with the classifications. These categorical ‘grades’ are problematic, so it’s best to make an additional column as a factor and set levels.

## make grade a factor variable for use with symbology below
wqi <- 
  wqi %>% 
  mutate(f_grade = factor(grade, levels = c("Poor", "Fair", "Average", "Good"))) 

levels(wqi$f_grade)
## [1] "Poor"    "Fair"    "Average" "Good"

Now we can code our leaflet map (just showing wqi, no swqi’s)

## define color palette
factpal <- colorFactor(c('#d7191c', '#fdae61', '#abdda4', '#2b83ba'), wqi$f_grade)

# overall WQI only
wqi_map <-
  wqi %>%
  
  leaflet() %>% 
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% 
  addCircleMarkers(
    col = ~factpal(f_grade),
    opacity = 0.8,
    popup = ~ as.character(round(wqi, 2)), label = ~ flowline_gnis) %>%
  addLegend(
    colors=c('#d7191c', '#fdae61', '#abdda4', '#2b83ba'),
    labels=c('Poor', 'Fair', 'Average', 'Good'),
    opacity = 0.8,
    title = 'DEP 2018 WQI Scores')

wqi_map


Now that we’ve visualized trout biomass and wqi throughout space, the next step is to export our clean dataframes to excel so we can merge with HUC12 watershed in GIS. This is an annoying step that requires multiple steps outside of the Rstudio project.

For example, it’s a pain to export to excel, import to GIS, spatial join, save as shapefile, open .dbf in excel, save, and re-import to R. Plus, I’m learning that DEP’s GIS only accepts .xls – while R exports as .xlsx. Need to find a work-around to eliminate yet another intermediate step.

For the long-term we need to find way to work with polygons/perfom GIS joins/relates in R. Maybe ArcPro would provide some less painful alternatives?

For the GIS task itself, you can actually build a spreadsheet [write_xlsx()] using dataframes in R. See example below using biomass and wqi dataframes. This helps prevent a bunch of individual csv files in your directory – too many and naming conventions become an issue.

## I don't want to take the entire dataframes out to merge with HUC12, GIS does strange things to numbers
## see if first 7 columns have enough unique info to complete merge

dim(classA_df) # 962  16

classA_df_merge <- 
  classA_df %>% 
  select(1:7) %>% # select only first 7 cols (includes lat/long)
  distinct() # determine how many unique combos of 1st 6 cols are present (hopefully 962!)

classA_df_merge
dim(classA_df_merge) # 962   7 -- check!


## same for wqi df

dim(wqi) #2625   17

wqi_merge <- 
  wqi %>% 
  select(1:6, 15:16) %>% # select only first 6 cols plus lat/long for projecting
  distinct() # determine how many unique combos of 1st 6 cols are present (hopefully 2625!)

wqi_merge
dim(wqi_merge) #2625    8 -- check!

# build spreadsheet/export dataframes as single excel sheet with 2 tabs for GIS merge with HUC12


write_xlsx(list(classA_df_merge = classA_df_merge, wqi_merge = wqi_merge), "gis_huc12.xlsx")

THe GIS merge happens in a black box outside of R.

I simply projected both dataframes as shapefiles, and merged with HUC12 polygon layer


Now we re-import the result as we did above

However, GIS chops off column names if they’re too long so we need to deal with that to merge with the original df to access the numeric fields of interest. Using piping we can read-in data, subset columns, rename columns, and join with original df all in one step.

classA_huc12_join <- read_excel('classA_huc12_join.xlsx', sheet = 'classA_huc12_join') %>% 
  
  select(2:6, 16:21) %>% # select only columns of interest
  
  rename(section_limits = section_li) %>% #rename columns that gis cut off
  
  left_join(classA_df, by=c("county", "water", "section", "fishery", "section_limits"))


We confirm that dfs merged successfully due to helpful messages from the tidylog package. When loaded after tidyverse this package gives narrations on what each function accomplished.

county water section fishery section_limits HUC_8 HUC_8_NAME HUC_10 HUC_10_NAM HUC_12 HUC_12_NAM lower_limit_latitude lower_limit_longitude length_miles wild_brook_trout_biomass wild_brown_trout_biomass wild_rainbow_trout_biomass total_biomass date_correct date month year
Lancaster Segloch Run 2 Brook Intersection of T 596 and T 548 to SR026 02050306 Lower Susquehanna 0205030609 Cocalico Creek 020503060902 Middle Creek 40.23230 -76.27900 2.60 27.35 29.95 NA 57.30 January 1983 1983-01-01 1 1983
Bedford-Blair Boiling Spring Run 2 Brown Dam at pond immediately downstream from Harvest Lane to Mouth 02050302 Upper Juniata 0205030201 Upper Frankstown Branch Juniata River 020503020101 Beaverdam Creek 40.27194 -78.46639 2.11 NA 56.21 NA 56.21 January 2017 2017-01-01 1 2017
Perry Kansas Valley Run 1 Mixed Brook/Brown Headwaters to Mouth 02050304 Lower Juniata 0205030409 Tuscarora Creek 020503040903 Horse Valley Run 40.34373 -77.59420 2.48 23.21 31.96 NA 55.17 March 2001 2001-03-01 3 2001


Same for the WQI - HUC12 df

wqi_huc12_join <- read_excel('wqi_huc12_join.xlsx', sheet = 'wqi_huc12_join') %>% 
  
  select(2:7, 17:22) %>% # select only columns of interest
  
  rename(nhd_location_id = nhd_locati, monitoring_alias = monitoring, 
         flowline_gnis = flowline_g, year_collected = year_colle) %>% #rename columns that gis cut off
  
  left_join(wqi, by=c("nhd_location_id", "monitori_1", "monitoring_alias", "sample_loc", "flowline_gnis", "year_collected"))


nhd_location_id monitori_1 monitoring_alias sample_loc flowline_gnis year_collected HUC_8 HUC_8_NAME HUC_10 HUC_10_NAM HUC_12 HUC_12_NAM wqi wq_similar forest urban ag land_dist grade n latitude longitude f_grade
1100734 NA NA Cedar Run at culvert outlet downstream of Waste Management NA 18 02050305 Lower Susquehanna-Swatara 0205030505 Yellow Breeches Creek 020503050505 Lower Yellow Breeches Creek 30.33704 Ag 33.33339 37.79162 30.33704 39.21016 Poor 1 40.23016 -76.95059 Poor
36455 NA WQN0212 OLD US RT 11; BRIDGE ST BR @ NEW CUMBERLAND Yellow Breeches Creek 8 02050305 Lower Susquehanna-Swatara 0205030505 Yellow Breeches Creek 020503050505 Lower Yellow Breeches Creek 49.34103 Forest 49.34103 50.26803 53.29568 76.61736 Fair 16 40.22430 -76.86069 Fair
36455 NA WQN0212 OLD US RT 11; BRIDGE ST BR @ NEW CUMBERLAND Yellow Breeches Creek 9 02050305 Lower Susquehanna-Swatara 0205030505 Yellow Breeches Creek 020503050505 Lower Yellow Breeches Creek 56.38481 Ag 57.43163 60.63640 56.38481 76.58827 Fair 19 40.22430 -76.86069 Fair

Now after all that wrangling, the analysis step becomes possible.

The analysis routine consists of averaging wqi in each huc12 watershed since we could potentially have multiple wqi points per huc 12. The resulting characterization of water quality in each watershed can serve as a predictor variable for trout biomass.

wqi_huc12_stat <- 
  wqi_huc12_join %>% 
  
  group_by(HUC_12, HUC_12_NAM) %>% # summarize by these fields
  
  summarize(mean_wqi = mean(wqi, na.rm=T), sd_wqi = sd(wqi, na.rm=T), n_wqi = length(wqi), # calculate stats
            minYear = min(year_collected, na.rm=T), maxYear = max(year_collected, na.rm=T)) 
# could keep going to calculate mean swqi scores


444 huc12s have wqi points within. Range of n is 1-37.

HUC_12 HUC_12_NAM mean_wqi sd_wqi n_wqi mean_forest_wqi sd_forest_wqi mean_urban_wqi sd_urban_wqi mean_ag_wqi sd_ag_wqi mean_land_dist_wqi sd_land_dist_wqi minYear maxYear
020401010402 Lower Equinunk Creek 93.47634 NA 1 97.07617 NA 98.40826 NA 93.51108 NA 93.47634 NA 18 18
020401010605 Masthope Creek 82.67053 NA 1 96.03777 NA 96.08027 NA 95.36778 NA 82.67053 NA 16 16
020401030201 East Branch Dyberry Creek 91.91476 1.170703 2 96.94153 2.389315 98.16760 1.569052 93.56753 3.508076 94.57599 2.592843 10 18


So water quality has been characterized (mean, sd wqi) for 444 huc12s

Now join with classA df using the HUC12 field to investigate how responsive trout biomass is to WQ

classA_wqi_df <- 
  classA_huc12_join %>% 
  
  inner_join(wqi_huc12_stat, by=c('HUC_12', 'HUC_12_NAM')) ## using inner_join so that only matching x & y records are returned


The merge resulted in 137 unique huc12s with water quality AND trout biomass data - decent n

Let’s plot the result, by species to visualize relationship between wqi (x) and trout biomass (y)

To accomplish, we need to melt and facet_wrap in ggplot2 to show x - y relationships. Piping like this - including melting and plotting results in no objects created in your local environment!

classA_wqi_df %>% 
  
  reshape2::melt(id.vars = c('HUC_12', 'HUC_12_NAM', 'mean_wqi', 'sd_wqi'),
                 measure.vars = c("wild_brook_trout_biomass", "wild_brown_trout_biomass", 
                                  "wild_rainbow_trout_biomass","total_biomass")) %>% 
  
  drop_na(value) %>% #drop the rows with NA value for trout biomass
  
  ggplot(aes(x=mean_wqi, y=value, color = variable, fill = variable)) +
  facet_wrap(~variable, ncol = 2, scales = 'free_y') +
  geom_point(pch=21, alpha = 0.7) +
  geom_smooth(method = lm) + 
  labs(x = 'Mean HUC12 WQI', y = 'Trout Biomass (kg/ha)') +
  theme_minimal() +
  theme(legend.position = 'none')


If we want to save high quality image file of plot, use ggsave() to manipulate file type, dpi, and dimensions

ggsave(file='WQI_vs_TroutBiomass.png', scale=1, units='in', dpi=300, width=10, height=6)





Here is a better representation of each species’ relationship to the overall WQI, and individual SWQI’s.



Subsequent regression modeling of brook and brown trout vs. water quality is revealing. Rainbow trout response was not modeled due to low sample size (n=14).

Species WQI Type Intercept Slope P R2_adjusted n Result
Brook Trout Agriculture SWQI 14.81 +/- 7.9 0.29 +/- 0.09 0.002 2.88% 307 Positive
Brook Trout Forest SWQI 14.78 +/- 9.09 0.29 +/- 0.1 0.006 2.11% 307 Positive
Brook Trout Land Disturbance SWQI 53.1 +/- 8.12 -0.16 +/- 0.1 0.092 0.60% 307 No Trend
Brook Trout Urban SWQI 16.35 +/- 10.07 0.27 +/- 0.11 0.021 1.42% 307 Positive
Brook Trout WQI 33.17 +/- 6.45 0.08 +/- 0.08 0.312 0.01% 307 No Trend
Brown Trout Agriculture SWQI 170.61 +/- 13.01 -1.5 +/- 0.17 0.000 26.38% 221 Negative
Brown Trout Forest SWQI 191.33 +/- 15.1 -1.7 +/- 0.19 0.000 26.78% 221 Negative
Brown Trout Land Disturbance SWQI 191.74 +/- 51.42 -1.54 +/- 0.6 0.010 2.53% 221 Negative
Brown Trout Urban SWQI 213.08 +/- 17.54 -1.92 +/- 0.21 0.000 26.58% 221 Negative
Brown Trout WQI 174.98 +/- 13.87 -1.65 +/- 0.19 0.000 25.20% 221 Negative


Brook trout biomass is significantly, positively correlated with agriculture, forest, and urban SWQIs. For example, for every 3 point increase in the agriculture SWQI score (improving water quality), brook trout biomass increased by ~ 1 kg/ha. This suggests that brook trout are responding positively to increases in water quality spatially across PA. Although these relationships are positive and significant, the models describe very little variation in the data (0 - 2.88%). The most likely interpretation is that there are many other stressors involved in controlling brook trout abundance such as habitat degradation and population fragmentation. Water quality has a role in regulating brook trout abundance, but it seems to be a minor factor.


Conversely, brown trout are negatively correlated with all SWQI scores and the overall WQI score. The slopes (effect sizes) are much greater for browns compared to brooks. For example, for every 1 point increase in overall WQI score, brown trout biomass decreases by 1.65 kg/ha. Additionally, the amount of variation described by brown trout models is much increased (25 - 27% for overall WQI and all SWQIs except land disturbance). Taken together, the magnitude and directionality of the slope estimates suggest that water quality is not playing a large role in regulating brown trout populations.


A shortcoming of this analysis may be that mean wqi scores in huc12 watersheds are not representative of on-site water quality. However, huc12s are small enough that it shouldn’t be a major issue.


I hope this exercise included some useful code while informing a current topic of discussion.